Background: review of differentiable functions of several variables and introduction to automatic differentiation

6.2. Background: review of differentiable functions of several variables and introduction to automatic differentiation#

We review the differential calculus of several variables. We highlight a few key results that will play an important role: the Chain Rule and the Mean Value Theorem. We also give a brief introduction to automatic differentiation.

6.2.1. Gradient#

Recall that a point \(\mathbf{x} \in \mathbb{R}^d\) is an interior point of a set \(A \subseteq \mathbb{R}^d\) if there exists an \(r > 0\) such that \(B_r(\mathbf{x}) \subseteq A\), where the open \(r\)-ball around \(\mathbf{x} \in \mathbb{R}^d\) is the set of points within Euclidean distance \(r\) of \(\mathbf{x}\), that is,

\[ B_r(\mathbf{x}) = \{\mathbf{y} \in \mathbb{R}^d \,:\, \|\mathbf{y} - \mathbf{x}\| < r\}. \]

Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable.

DEFINITION (Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\). The derivative of \(f\) at \(x_0\) is

\[ f'(x_0) = \frac{\mathrm{d} f (x_0)}{\mathrm{d} x} = \lim_{h \to 0} \frac{f(x_0 + h) - f(x_0)}{h} \]

provided the limit exists. \(\natural\)

For functions of several variables, we have the following generalization. We let \(\mathbf{e}_i \in \mathbb{R}^d\) be the \(i\)-th standard basis vector.

DEFINITION (Partial Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 = (x_{0,1},\ldots,x_{0,d}) \in D\) be an interior point of \(D\). The partial derivative of \(f\) at \(\mathbf{x}_0\) with respect to \(x_i\) is

\[\begin{align*} \frac{\partial f (\mathbf{x}_0)}{\partial x_i} &= \lim_{h \to 0} \frac{f(\mathbf{x}_0 + h \mathbf{e}_i) - f(\mathbf{x}_0)}{h}\\ &= \lim_{h \to 0} \frac{f(x_{0,1},\ldots,x_{0,i-1},x_{0,i} + h,x_{0,i+1},\ldots,x_{0,d}) - f(x_{0,1},\ldots,x_{0,d})}{h} \end{align*}\]

provided the limit exists. If \(\frac{\partial f (\mathbf{x}_0)}{\partial x_i}\) exists and is continuous in an open ball around \(\mathbf{x}_0\) for all \(i\), then we say that \(f\) continuously differentiable at \(\mathbf{x}_0\). \(\natural\)

DEFINITION (Gradient) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume \(f\) is continuously differentiable at \(\mathbf{x}_0\). The (column) vector

\[ \nabla f(\mathbf{x}_0) = \left(\frac{\partial f (\mathbf{x}_0)}{\partial x_1}, \ldots, \frac{\partial f (\mathbf{x}_0)}{\partial x_d}\right) \]

is called the gradient of \(f\) at \(\mathbf{x}_0\). \(\natural\)

Note that the gradient is itself a function of \(\mathbf{x}\). In fact, unlike \(f\), it is a vector-valued function.

Figure: Gradient as a function (Source)

Gradient as a function

\(\bowtie\)

EXAMPLE: Consider the affine function

\[ f(\mathbf{x}) = \mathbf{q}^T \mathbf{x} + r \]

where \(\mathbf{x} = (x_1, \ldots, x_d)^T, \mathbf{q} = (q_1, \ldots, q_d)^T \in \mathbb{R}^d\). The partial derivatives of the linear term are given by

\[ \frac{\partial}{\partial x_i} [\mathbf{q}^T \mathbf{x}] = \frac{\partial}{\partial x_i} \left[\sum_{j=1}^d q_j x_j \right] = \frac{\partial}{\partial x_i} \left[q_i x_i \right] = q_i. \]

So the gradient of \(f\) is

\[ \nabla f(\mathbf{x}) = \mathbf{q}. \]

\(\lhd\)

EXAMPLE: Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r. \]

where \(\mathbf{x} = (x_1, \ldots, x_d)^T, \mathbf{q} = (q_1, \ldots, q_d)^T \in \mathbb{R}^d\) and \(P \in \mathbb{R}^{d \times d}\). The partial derivatives of the quadratic term are given by

\[\begin{align*} \frac{\partial}{\partial x_i} [\mathbf{x}^T P \mathbf{x}] &= \frac{\partial}{\partial x_i} \left[\sum_{j, k=1}^d P_{jk} x_j x_k \right]\\ &= \frac{\partial}{\partial x_i} \left[P_{ii} x_i^2 + \sum_{j=1, j\neq i}^d P_{ji} x_j x_i + \sum_{k=1, k\neq i}^d P_{ik} x_i x_k \right], \end{align*}\]

where we used that all terms not including \(x_i\) have partial derivative \(0\).

This last expression is

\[\begin{align*} &= 2 P_{ii} x_i + \sum_{j=1, j\neq i}^d P_{ji} x_j + \sum_{k=1, k\neq i}^d P_{ik} x_k\\ &= \sum_{j=1}^d [P^T]_{ij} x_j + \sum_{k=1}^d [P]_{ik} x_k\\ &= ([P + P^T]\mathbf{x})_i. \end{align*}\]

So the gradient of \(f\) is

\[ \nabla f(\mathbf{x}) = \frac{1}{2}[P + P^T] \,\mathbf{x} + \mathbf{q}. \]

If \(P\) is symmetric, this further simplifies to \(\nabla f(\mathbf{x}) = P \,\mathbf{x} + \mathbf{q}\). \(\lhd\)

It will be useful to compute the derivative of a function \(f(\mathbf{x})\) of several variables along a parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t))^T \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). The following result is a special case of an important fact. We will use the following notation

\[\begin{split} \mathbf{g}'(t) = \begin{pmatrix} g_1'(t)\\ \vdots\\ g_m'(t) \end{pmatrix}. \end{split}\]

We say that \(\mathbf{g}(t)\) is continuously differentiable at \(t = t_0\) if each of its component is.

EXAMPLE: (Parametric Line) The straight line between \(\mathbf{x}_0 = (x_{0,1},\ldots,x_{0,d})^T\) and \(\mathbf{x}_1 = (x_{1,1},\ldots,x_{1,d})^T\) in \(\mathbb{R}^d\) can be parametrized as

\[ \mathbf{g}(t) = \mathbf{x}_0 + t (\mathbf{x}_1 - \mathbf{x}_0), \]

where \(t\) goes from \(0\) (at which \(\mathbf{g}(0) = \mathbf{x}_0\)) to \(1\) (at which \(\mathbf{g}(1) = \mathbf{x}_1\)).

Then

\[ g_i'(t) = \frac{\mathrm{d}}{\mathrm{d}t} [x_{0,i} + t (x_{1,i} - x_{0,i})] = x_{1,i} - x_{0,i}, \]

so that

\[ \mathbf{g}'(t) = \mathbf{x}_1 - \mathbf{x}_0. \]

\(\lhd\)

Before stating the Chain Rule, first recall the Mean Value Theorem in the single-variable case, which we will use in the proof.

THEOREM (Mean Value) Let \(f : [a,b] \to \mathbb{R}\) be a continuous function and assume that its derivative exists on \((a,b)\). Then there is \(a < c < b\) such that

\[ f(b) = f(a) + (b-a)f'(c). \]

\(\sharp\)

Recall the Chain Rule in the single-variable case. Quoting Wikipedia:

The simplest form of the chain rule is for real-valued functions of one real variable. It states that if \(g\) is a function that is differentiable at a point \(c\) (i.e. the derivative \(g'(c)\) exists) and f is a function that is differentiable at \(g(c)\), then the composite function \({\displaystyle f\circ g}\) is differentiable at \(c\), and the derivative is \({\displaystyle (f\circ g)'(c)=f'(g(c))\cdot g'(c)}\).

Here is a straightforward generalization of the Chain Rule.

THEOREM (Chain Rule) Let \(f : D_1 \to \mathbb{R}\), where \(D_1 \subseteq \mathbb{R}\), and let \(g : D_2 \to \mathbb{R}\), where \(D_2 \subseteq \mathbb{R}^d\). Assume that \(f\) is continuously differentiable at \(g(\mathbf{x}_0)\), an interior point of \(D_1\), and that \(g\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_2\). Then

\[ \nabla (f\circ g) (\mathbf{x}_0) = f'(g(\mathbf{x}_0)) \nabla g(\mathbf{x}_0). \]

\(\sharp\)

Proof: We apply the Chain Rule for functions of one variable to the partial derivatives. For all \(i\),

\[ \frac{\partial}{\partial x_i}f (g(\mathbf{x}_0)) = f'(g(\mathbf{x}_0)) \frac{\partial}{\partial x_i} g(\mathbf{x}_0). \]

Collecting the partial derivatives in a vector gives the claim. \(\square\)

Here is a different generalization of the Chain Rule. Again the composition \(f \circ \mathbf{g}\) denotes the function \(f \circ \mathbf{g}(t) = f (\mathbf{g}(t))\).

THEOREM (Chain Rule) Let \(f : D_1 \to \mathbb{R}\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^d\), where \(D_2 \subseteq \mathbb{R}\). Assume that \(f\) is continuously differentiable at \(\mathbf{g}(t_0)\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(t_0\), an interior point of \(D_2\). Then

\[ (f\circ \mathbf{g})'(t_0) = \nabla f (\mathbf{g}(t_0))^T \mathbf{g}'(t_0). \]

\(\sharp\)

Proof: To simplify the notation, suppose that \(f\) is a real-valued function of \(\mathbf{x} = (x_1, \ldots, x_d)\) whose components are themselves functions of \(t \in \mathbb{R}\). Assume \(f\) is continuously differentiable at \(\mathbf{x}(t)\). To compute the total derivative \(\frac{\mathrm{d} f(t)}{\mathrm{d} t}\), let \(\Delta x_k = x_k(t + \Delta t) - x_k(t)\), \(x_k = x_k(t)\) and

\[ \Delta f = f(x_1 + \Delta x_1, \ldots, x_d + \Delta x_d) - f(x_1, \ldots, x_d). \]

We seek to compute the limit \(\lim_{\Delta t \to 0} \frac{\Delta f}{\Delta t}\). To relate this limit to partial derivatives of \(f\), we re-write \(\Delta f\) as a telescoping sum where each term involves variation of a single variable \(x_k\). That is,

\[\begin{align*} \Delta f = & [f(x_1 + \Delta x_1, \ldots, x_d + \Delta x_d) - f(x_1, x_2 + \Delta x_2, \ldots, x_d + \Delta x_d)]\\ &+ [f(x_1, x_2 + \Delta x_2, \ldots, x_d + \Delta x_d) - f(x_1, x_2, x_3 + \Delta x_3, \ldots, x_d + \Delta x_d)] \\ &+ \cdots + [f(x_1, \cdots, x_{d-1}, x_d + \Delta x_d) - f(x_1, \ldots, x_d)]. \end{align*}\]

Applying the Mean Value Theorem to each term gives

\[\begin{align*} \Delta f = & \Delta x_1 \frac{\partial f(x_1 + \theta_1 \Delta x_1, x_2 + \Delta x_2, \ldots, x_d + \Delta x_d)} {\partial x_1}\\ &+ \Delta x_2 \frac{\partial f(x_1, x_2 + \theta_2 \Delta x_2, x_3 + \Delta x_3, \ldots, x_d + \Delta x_d)} {\partial x_2}\\ &+ \cdots + \Delta x_d \frac{\partial f(x_1, \cdots, x_{d-1}, x_d + \theta_d \Delta x_d)} {\partial x_d} \end{align*}\]

where \(0 < \theta_k < 1\) for \(k=1,\ldots,d\). Dividing by \(\Delta t\), taking the limit \(\Delta t \to 0\) and using the fact that \(f\) is continuously differentiable, we get

\[ \frac{\mathrm{d} f (t)}{\mathrm{d} t} = \sum_{k=1}^d \frac{\partial f(\mathbf{x}(t))} {\partial x_k} \frac{\mathrm{d} x_k(t)}{\mathrm{d} t}. \]

\(\square\)

As a first application of the Chain Rule, we generalize the Mean Value Theorem to the case of several variables. We will use this result later to prove a multivariable Taylor expansion result that will play a central role in this chapter.

THEOREM (Mean Value) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\). Let \(\mathbf{x}_0 \in D\) and \(\delta > 0\) be such that \(B_\delta(\mathbf{x}_0) \subseteq D\). If \(f\) is continuously differentiable on \(B_\delta(\mathbf{x}_0)\), then for any \(\mathbf{x} \in B_\delta(\mathbf{x}_0)\)

\[ f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0 + \xi \mathbf{p})^T \mathbf{p} \]

for some \(\xi \in (0,1)\), where \(\mathbf{p} = \mathbf{x} - \mathbf{x}_0\). \(\sharp\)

One way to think of the Mean Value Theorem is as a \(0\)-th order Taylor expansion. It says that, when \(\mathbf{x}\) is close to \(\mathbf{x}_0\), the value \(f(\mathbf{x})\) is close to \(f(\mathbf{x}_0)\) in a way that can be controlled in terms of the gradient in the neighborhood of \(\mathbf{x}_0\). From this point of view, the term \(\nabla f(\mathbf{x}_0 + \xi \mathbf{p})^T \mathbf{p}\) is called the Lagrange remainder.

Proof idea: We apply the single-variable result and the Chain Rule.

Proof: Let \(\phi(t) = f(\boldsymbol{\alpha}(t))\) where \(\boldsymbol{\alpha}(t) = \mathbf{x}_0 + t \mathbf{p}\). Observe that \(\phi(0) = f(\mathbf{x}_0)\) and \(\phi(1) = f(\mathbf{x})\). By the Chain Rule and the Parametric Line Example,

\[ \phi'(t) = \nabla f(\boldsymbol{\alpha}(t))^T \boldsymbol{\alpha}'(t) = \nabla f(\boldsymbol{\alpha}(t))^T \mathbf{p} = \nabla f(\mathbf{x}_0 + t \mathbf{p})^T \mathbf{p}. \]

In particular, \(\phi\) has a continuous first derivative on \([0,1]\). By the Mean Value Theorem in the single-variable case

\[ \phi(t) = \phi(0) + t \phi'(\xi) \]

for some \(\xi \in (0,t)\). Plugging in the expressions for \(\phi(0)\) and \(\phi'(\xi)\) and taking \(t=1\) gives the claim. \(\square\)

6.2.2. Second-order derivatives#

One can also define higher-order derivatives. We start with the single-variable case, where \(f : D \to \mathbb{R}\) with \(D \subseteq \mathbb{R}\) and \(x_0 \in D\) is an interior point of \(D\). Note that, if \(f'\) exists in \(D\), then it is itself a function of \(x\). Then the second derivative at \(x_0\) is

\[ f''(x_0) = \frac{\mathrm{d}^2 f(x_0)}{\mathrm{d} x^2} = \lim_{h \to 0} \frac{f'(x_0 + h) - f'(x_0)}{h} \]

provided the limit exists.

In the several variable case, we have the following:

DEFINITION (Second Partial Derivatives and Hessian) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(f\) is continuously differentiable in an open ball around \(\mathbf{x}_0\). Then \(\partial f(\mathbf{x})/\partial x_i\) is itself a function of \(\mathbf{x}\) and its partial derivative with respect to \(x_j\), if it exists, is denoted by

\[ \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_j \partial x_i} = \lim_{h \to 0} \frac{\frac{\partial f}{\partial x_i}(\mathbf{x}_0 + h \mathbf{e}_j) - \frac{\partial f}{\partial x_i}(\mathbf{x}_0)}{h}. \]

To simplify the notation, we write this as \(\partial^2 f(\mathbf{x}_0)/\partial x_i^2\) when \(j = i\). If \(\partial^2 f(\mathbf{x})/\partial x_j \partial x_i\) and \(\partial^2 f(\mathbf{x})/\partial x_i^2\) exist and are continuous in an open ball around \(\mathbf{x}_0\) for all \(i, j\), we say that \(f\) is twice continuously differentiable at \(\mathbf{x}_0\).

The matrix of second derivatives is called the Hessian and is denoted by

\[\begin{split} \mathbf{H}_f(\mathbf{x}_0) = \begin{pmatrix} \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1^2} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d \partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1 \partial x_d} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d^2} \end{pmatrix}. \end{split}\]

\(\natural\)

Like \(f\) and the gradient \(\nabla f\), the Hessian \(\mathbf{H}_f\) is a function of \(\mathbf{x}\). It is a matrix-valued function however.

When \(f\) is twice continuously differentiable at \(\mathbf{x}_0\), its Hessian is a symmetric matrix.

THEOREM (Symmetry of the Hessian) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(f\) is twice continuously differentiable at \(\mathbf{x}_0\). Then for all \(i \neq j\)

\[ \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_j \partial x_i} = \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_i \partial x_j}. \]

\(\sharp\)

Proof idea: Two applications of the Mean Value Theorem show that the limits can be interchanged.

Proof: By definition of the partial derivative,

\[\begin{align*} \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_j \partial x_i} &= \lim_{h_j \to 0} \frac{\frac{\partial f}{\partial x_i}(\mathbf{x}_0 + h_j \mathbf{e}_j) - \frac{\partial f}{\partial x_i}(\mathbf{x}_0)}{h_j}\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{1}{h_j h_i} \left\{ [f(\mathbf{x}_0 + h_j \mathbf{e}_j + h_i \mathbf{e}_i) - f(\mathbf{x}_0 + h_j \mathbf{e}_j)] - [f(\mathbf{x}_0 + h_i \mathbf{e}_i) - f(\mathbf{x}_0)] \right\}\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{1}{h_i} \left\{ \frac{[f(\mathbf{x}_0 + h_i \mathbf{e}_i + h_j \mathbf{e}_j) - f(\mathbf{x}_0 + h_i \mathbf{e}_i)] - [f(\mathbf{x}_0 + h_j \mathbf{e}_j) - f(\mathbf{x}_0)]}{h_j} \right\}\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{1}{h_i} \left\{\frac{\partial}{\partial x_j}[f(\mathbf{x}_0 + h_i \mathbf{e}_i + \theta_j h_j \mathbf{e}_j) - f(\mathbf{x}_0 + \theta_j h_j \mathbf{e}_j)] \right\}\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{1}{h_i} \left\{\frac{\partial f}{\partial x_j}(\mathbf{x}_0 + h_i \mathbf{e}_i + \theta_j h_j \mathbf{e}_j) - \frac{\partial f}{\partial x_j}(\mathbf{x}_0 + \theta_j h_j \mathbf{e}_j) \right\} \end{align*}\]

for some \(\theta_j \in (0,1)\). Note that, on the third line, we rearranged the terms and, on the fourth line, we applied the Mean Value Theorem to \(f(\mathbf{x}_0 + h_i \mathbf{e}_i + h_j \mathbf{e}_j) - f(\mathbf{x}_0 + h_j \mathbf{e}_j)\) as a continuously differentiable function of \(h_j\).

Because \(\partial f/\partial x_j\) is continuously differentiable in an open ball around \(\mathbf{x}_0\), a second application of the Mean Value Theorem gives for some \(\theta_i \in (0,1)\)

\[\begin{align*} &\lim_{h_j \to 0} \lim_{h_i \to 0} \frac{1}{h_i} \left\{\frac{\partial f}{\partial x_j}(\mathbf{x}_0 + h_i \mathbf{e}_i + \theta_j h_j \mathbf{e}_j) - \frac{\partial f}{\partial x_j}(\mathbf{x}_0 + \theta_j h_j \mathbf{e}_j) \right\}\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{\partial}{\partial x_i}\left[\frac{\partial f}{\partial x_j}(\mathbf{x}_0 + \theta_j h_j \mathbf{e}_j + \theta_i h_i \mathbf{e}_i)\right]\\ &= \lim_{h_j \to 0} \lim_{h_i \to 0} \frac{\partial^2 f(\mathbf{x}_0 + \theta_j h_j \mathbf{e}_j + \theta_i h_i \mathbf{e}_i)}{\partial x_i \partial x_j}. \end{align*}\]

The claim then follows from the continuity of \(\partial^2 f/\partial x_i \partial x_j\). \(\square\)

EXAMPLE: Consider the quadratic function

\[ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r. \]

Recall that the gradient of \(f\) is

\[ \nabla f(\mathbf{x}) = \frac{1}{2}[P + P^T] \,\mathbf{x} + \mathbf{q}. \]

To simplify the calculation, let \(B = \frac{1}{2}[P + P^T]\) and denote the rows of \(B\) by \(\mathbf{b}_1^T, \ldots,\mathbf{b}_d^T\).

Each component of \(\nabla f\) is an affine function of \(\mathbf{x}\), specifically,

\[ \frac{\partial f (\mathbf{x})}{\partial x_i} = \mathbf{b}_i^T \mathbf{x} + q_i. \]

Row \(i\) of the Hessian is simply the gradient transposed of \(\frac{\partial f (\mathbf{x})}{\partial x_i}\) which, by our previous results, is

\[ \left(\nabla \frac{\partial f (\mathbf{x})}{\partial x_i}\right)^T = \mathbf{b}_i^T. \]

Putting this together we get

\[ \mathbf{H}_f(\mathbf{x}) = \frac{1}{2}[P + P^T]. \]

Observe that this is indeed a symmetric matrix. \(\lhd\)

6.2.3. Jacobian and chain rule#

Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable.

DEFINITION (Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\). The derivative of \(f\) at \(x_0\) is

\[ f'(x_0) = \frac{\mathrm{d} f (x_0)}{\mathrm{d} x} = \lim_{h \to 0} \frac{f(x_0 + h) - f(x_0)}{h} \]

provided the limit exists. \(\natural\)

A different way to put this is that \(f'(x)\) is the slope of the tangent line to \(f\) at \(x\). Formally, one can approximate \(f(x)\) by a linear function in the neighborhood of \(x\) as follows

\[ f(x + h) = f(x) + f'(x) h + r(h), \]

where \(r(h)\) is negligible compared to \(h\) in the sense that

\[ \lim_{h\to 0} \frac{r(h)}{h} = 0. \]

Indeed, define

\[ r(h) = f(x + h) - f(x) - f'(x) h. \]

Then by definition of the derivative

\[ \lim_{h\to 0} \frac{r(h)}{h} = \lim_{h\to 0} \frac{f(x + h) - f(x) - f'(x) h}{h} = \lim_{h\to 0} \left[\frac{f(x + h) - f(x)}{h} - f'(x) \right] = 0. \]

For vector-valued functions, we have the following generalization. Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x} \in D\) be an interior point of \(D\). We say that \(\mathbf{f}\) is diffentiable at \(\mathbf{x}\) if there exists a matrix \(A \in \mathbb{R}^{m \times d}\) such that

\[ \mathbf{f}(\mathbf{x}+\mathbf{h}) = \mathbf{f}(\mathbf{x}) + A \mathbf{h} + \mathbf{r}(\mathbf{h}) \]

where

\[ \lim_{\mathbf{h} \to 0} \frac{\|\mathbf{r}(\mathbf{h})\|_2}{\|\mathbf{h}\|_2} = 0. \]

The matrix \(\mathbf{f}'(\mathbf{x}) = A\) is called the differential of \(\mathbf{f}\) at \(\mathbf{x}\), and we see that the affine map \(\mathbf{f}(\mathbf{x}) + A \mathbf{h}\) provides an approximation of \(\mathbf{f}\) in the neighborhood of \(\mathbf{x}\).

We will not derive the full theory here. In the case where each component of \(\mathbf{f}\) has continuous partial derivatives in a neighborhood of \(\mathbf{x}\), then the differential exists and is equal to the Jacobian, as defined next.

DEFINITION (Jacobian) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\) where \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists for all \(i, j\). The Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\) is the \(d \times m\) matrix

\[\begin{split} J_{\mathbf{f}}(\mathbf{x}_0) = \begin{pmatrix} \frac{\partial f_1 (\mathbf{x}_0)}{\partial x_1} & \ldots & \frac{\partial f_1 (\mathbf{x}_0)}{\partial x_d}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_m (\mathbf{x}_0)}{\partial x_1} & \ldots & \frac{\partial f_m (\mathbf{x}_0)}{\partial x_d} \end{pmatrix}. \end{split}\]

\(\natural\)

THEOREM (Differential and Jacobian) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists and is continous is an open ball around \(\mathbf{x}_0\) for all \(i, j\). Then the differential at \(\mathbf{x}_0\) is equal to the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\). \(\sharp\)

Recall that for any \(A, B\) for which \(AB\) is well-defined it holds that

\[ \|A B \|_F \leq \|A\|_F \|B\|_F. \]

This applies in particular when \(B\) is a column vector, in which case \(\|B\|_F\) is its Euclidean norm.

Proof: By the Mean Value Theorem, for each \(i\), there is \(\xi_{\mathbf{h},i} \in (0,1)\) such that

\[ f_i(\mathbf{x}_0+\mathbf{h}) = f_i(\mathbf{x}_0) + \nabla f_i(\mathbf{x}_0 + \xi_{\mathbf{h},i} \mathbf{h})^T \mathbf{h}. \]

Define

\[\begin{split} \tilde{J}(\mathbf{h}) = \begin{pmatrix} \frac{\partial f_1 (\mathbf{x}_0 + \xi_{\mathbf{h},1} \mathbf{h})}{\partial x_1} & \ldots & \frac{\partial f_1 (\mathbf{x}_0 + \xi_{\mathbf{h},1} \mathbf{h})}{\partial x_d}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_m (\mathbf{x}_0 + \xi_{\mathbf{h},m} \mathbf{h})}{\partial x_1} & \ldots & \frac{\partial f_m (\mathbf{x}_0 + \xi_{\mathbf{h},m} \mathbf{h})}{\partial x_d} \end{pmatrix}. \end{split}\]

Hence we have

\[ \mathbf{f}(\mathbf{x}_0+\mathbf{h}) - \mathbf{f}(\mathbf{x}_0) - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h} = \tilde{J}(\mathbf{h}) \,\mathbf{h} - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h} = \left(\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\right)\,\mathbf{h}. \]

Taking a limit as \(\mathbf{h}\) goes to \(0\), we get

\[\begin{align*} \lim_{\mathbf{h} \to 0} \frac{\|\mathbf{f}(\mathbf{x}_0+\mathbf{h}) - \mathbf{f}(\mathbf{x}_0) - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h}\|_2}{\|\mathbf{h}\|_2} &= \lim_{\mathbf{h} \to 0} \frac{\|\left(\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\right)\,\mathbf{h}\|_2}{\|\mathbf{h}\|_2}\\ &\leq \lim_{\mathbf{h} \to 0} \frac{\|\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\|_F \|\mathbf{h}\|_2}{\|\mathbf{h}\|_2}\\ &= 0, \end{align*}\]

by continuity of the partial derivatives. \(\square\)

EXAMPLE: An example of a vector-valued function is

\[\begin{split} \mathbf{g}(x_1,x_2) = \begin{pmatrix} g_1(x_1,x_2)\\ g_2(x_1,x_2)\\ g_3(x_1,x_2) \end{pmatrix} = \begin{pmatrix} 3 x_1^2\\ x_2\\ x_1 x_2 \end{pmatrix}. \end{split}\]

Its Jacobian is

\[\begin{split} J_{\mathbf{g}}(x_1, x_2) = \begin{pmatrix} \frac{\partial g_1 (x_1, x_2)}{\partial x_1} & \frac{\partial g_1 (x_1, x_2)}{\partial x_2}\\ \frac{\partial g_2 (x_1, x_2)}{\partial x_1} & \frac{\partial g_2 (x_1, x_2)}{\partial x_2}\\ \frac{\partial g_3 (x_1, x_2)}{\partial x_1} & \frac{\partial g_3 (x_1, x_2)}{\partial x_2} \end{pmatrix} = \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}. \end{split}\]

\(\lhd\)

EXAMPLE: (Gradient and Jacobian) For a continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian reduces to the row vector

\[ J_{f}(\mathbf{x}_0) = \left(\frac{\partial f (\mathbf{x}_0)}{\partial x_1}, \ldots, \frac{\partial f (\mathbf{x}_0)}{\partial x_d}\right)^T = \nabla f(\mathbf{x}_0)^T \]

where \(\nabla f(\mathbf{x}_0)\) is the gradient of \(f\) at \(\mathbf{x}_0\). \(\lhd\)

EXAMPLE: (Hessian and Jacobian) For a twice continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian of its gradient is

\[\begin{split} J_{\nabla f}(\mathbf{x}_0) = \begin{pmatrix} \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1^2} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d \partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1 \partial x_d} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d^2} \end{pmatrix}, \end{split}\]

that is, the Hessian of \(f\) at \(\mathbf{x}_0\). \(\lhd\)

EXAMPLE: (Parametric Curve and Jacobian) Consider the parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t))^T \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). Assume that \(\mathbf{g}(t)\) is continuously differentiable at \(t = t_0\), that is, each of its component is.

Then

\[\begin{split} J_{\mathbf{g}}(\mathbf{x}_0) = \begin{pmatrix} g_1'(t)\\ \vdots\\ g_m'(t) \end{pmatrix} = \mathbf{g}'(t). \end{split}\]

\(\lhd\)

EXAMPLE: (Affine Map) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m)^T \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as

\[ \mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}. \]

This is an affine map. Note in particular that, in the case \(\mathbf{b} = \mathbf{0}\) of a linear map,

\[ \mathbf{f}(\mathbf{x} + \mathbf{y}) = A(\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y} = \mathbf{f}(\mathbf{x}) + \mathbf{f}(\mathbf{y}). \]

Denote the rows of \(A\) by \(\mathbf{a}_1^T,\ldots,\mathbf{a}_m^T\).

We compute the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\). Note that

\[\begin{align*} \frac{\partial f_i (\mathbf{x})}{\partial x_j} &= \frac{\partial}{\partial x_j}[\mathbf{a}_i^T \mathbf{x} + b_i]\\ &= \frac{\partial}{\partial x_j}\left[\sum_{\ell=1}^m a_{i,\ell} x_{\ell} + b_i\right]\\ &= a_{i,j}. \end{align*}\]

So

\[ J_{\mathbf{f}}(\mathbf{x}_0) = A. \]

\(\lhd\)

Functions are often obtained from the composition of simpler ones. We will use the vector notation \(\mathbf{h} = \mathbf{g} \circ \mathbf{f}\) for the function \(\mathbf{h}(\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x}))\).

LEMMA (Composition of Continuous Functions) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuous at \(\mathbf{x}_0\) and that \(\mathbf{g}\) is continuous \(\mathbf{f}(\mathbf{x}_0)\). Then \(\mathbf{g} \circ \mathbf{f}\) is continuous at \(\mathbf{x}_0\). \(\flat\)

The Chain Rule gives a formula for the Jacobian of a composition.

THEOREM (Chain Rule) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then

\[ J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0)) \,J_{\mathbf{f}}(\mathbf{x}_0) \]

as a product of matrices. \(\sharp\)

Intuitively, the Jacobian provides a linear approximation of the function in the neighborhood of a point. The composition of linear maps corresponds to the product of the associated matrices. Similarly, the Jacobian of a composition is the product of the Jacobians.

Proof: By the Chain Rule for a multivariable function over a parametric curve, we have that

\[ \frac{\partial h_i(\mathbf{x}_0)}{\partial x_j} = \sum_{k=1}^m \frac{\partial g_i(\mathbf{f}(\mathbf{x}_0))} {\partial f_k} \frac{\partial f_k(\mathbf{x}_0)}{\partial x_j} \]

where the notation \(\frac{\partial g} {\partial f_k}\) indicates the partial derivative of \(g\) with respect to its \(k\)-th component. In matrix form, the claim follows. \(\square\)

EXAMPLE: (Affine Map continued) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define again the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). In addition, let \(C \in \mathbb{R}^{p \times m}\) and \(\mathbf{d} \in \mathbb{R}^{p}\), define \(\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^p\) as \(\mathbf{g}(\mathbf{y}) = C \mathbf{y} + \mathbf{d}\).

Then

\[ J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x})) \,J_{\mathbf{f}}(\mathbf{x}) = C A, \]

for all \(\mathbf{x} \in \mathbb{R}^d\).

This is consistent with the observation that

\[ \mathbf{g} \circ \mathbf{f} (\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x})) = C( A\mathbf{x} + \mathbf{b} ) + \mathbf{d} = CA \mathbf{x} + (C\mathbf{b} + \mathbf{d}). \]

\(\lhd\)

EXAMPLE: Suppose we want to compute the gradient of the function

\[ f(x_1, x_2) = 3 x_1^2 + x_2 + \exp(x_1 x_2). \]

We could apply the Chain Rule directly, but to illustrate the perspective that is coming up in a subsequent section, we think of \(f\) as a composition of “simpler” vector-valued functions. Specifically, let

\[\begin{split} \mathbf{g}(x_1,x_2) = \begin{pmatrix} 3 x_1^2\\ x_2\\ x_1 x_2 \end{pmatrix} \qquad h(y_1,y_2,y_3) = y_1 + y_2 + \exp(y_3). \end{split}\]

Then \(f(x_1, x_2) = h(\mathbf{g}(x_1, x_2)) = h \circ \mathbf{g}(x_1, x_2)\).

By the Chain Rule, we can compute the gradient of \(f\) by first computing the Jacobians of \(\mathbf{g}\) and \(h\). We have already computed the Jacobian of \(\mathbf{g}\)

\[\begin{split} J_{\mathbf{g}}(x_1, x_2) = \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}. \end{split}\]

The Jacobian of \(h\) is

\[ J_h(y_1, y_2, y_3) = \begin{pmatrix} \frac{\partial h(y_1, y_2, y_3)}{\partial y_1} & \frac{\partial h(y_1, y_2, y_3)}{\partial y_2} & \frac{\partial h(y_1, y_2, y_3)}{\partial y_3} \end{pmatrix} = \begin{pmatrix} 1 & 1 & \exp(y_3) \end{pmatrix}. \]

The Chain Rule stipulates that

\[\begin{align*} \nabla f(x_1, x_2)^T &= J_f(x_1, x_2)\\ &= J_h(\mathbf{g}(x_1,x_2)) \, J_{\mathbf{g}}(x_1, x_2)\\ &= \begin{pmatrix} 1 & 1 & \exp(g_3(x_1, x_2)) \end{pmatrix} \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}\\ &= \begin{pmatrix} 1 & 1 & \exp(x_1 x_2) \end{pmatrix} \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}\\ &= \begin{pmatrix} 6 x_1 + x_2 \exp(x_1 x_2) & 1 + x_1 \exp(x_1 x_2) \end{pmatrix}. \end{align*}\]

You can check directly that this is indeed the correct gradient (transposed). \(\lhd\)

6.2.4. Brief introduction to automatic differentiation#

We illustrate the use of automatic differentiation to compute gradients.

Quoting Wikipedia:

In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.

We will use PyTorch. It uses tensors, which in many ways behave similarly to Numpy arrays. See here for a quick introduction. Here is an example. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option requires_grad=True, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process.

# Initialize variables
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)

The function .backward() computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad. We first define the function. Note that we use torch.exp, the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of ** for taking a power. Here is a list of operations on tensors in PyTorch.

# Define function
f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)
# Perform automatic differentiation
f.backward()  # Compute gradients
# Print gradients
print(x1.grad)  # df/dx
print(x2.grad)  # df/dy
tensor(20.7781)
tensor(8.3891)

The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use torch.sum for taking a sum of the arguments.

# New variables for the second example
z = torch.tensor([1., 2., 3.], requires_grad=True)
# Perform automatic differentiation
g = torch.sum(z ** 2)
g.backward()  # Compute gradients
# Print gradient
print(z.grad)  # gradient is (2 z_1, 2 z_2, 2 z_3)
tensor([2., 4., 6.])

Here is another typical example in a data science context.

# Variables for the third example
X = torch.randn(3, 2)  # Random dataset (features)
y = torch.tensor([[1., 0., 1.]])  # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True)  # Parameter assignment
# Perform automatic differentiation
predict = X @ theta  # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2)  # Loss function
loss.backward()  # Compute gradients
# Print gradient
print(theta.grad)  # gradient of loss
tensor([[0.3187],
        [2.6270]])

LEARNING BY CHATTING: Ask your favorite AI chatbot to explain how to compute a second derivative using PyTorch (it’s bit tricky). Ask for code that you can apply to the previous examples.