6.2. Background: review of differentiable functions of several variables and introduction to automatic differentiation#
We review the differential calculus of several variables. We highlight a few key results that will play an important role: the Chain Rule and the Mean Value Theorem. We also give a brief introduction to automatic differentiation.
6.2.1. Gradient#
Recall that a point \(\mathbf{x} \in \mathbb{R}^d\) is an interior point of a set \(A \subseteq \mathbb{R}^d\) if there exists an \(r > 0\) such that \(B_r(\mathbf{x}) \subseteq A\), where the open \(r\)-ball around \(\mathbf{x} \in \mathbb{R}^d\) is the set of points within Euclidean distance \(r\) of \(\mathbf{x}\), that is,
Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable.
DEFINITION (Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\). The derivative of \(f\) at \(x_0\) is
provided the limit exists. \(\natural\)
For functions of several variables, we have the following generalization. We let \(\mathbf{e}_i \in \mathbb{R}^d\) be the \(i\)-th standard basis vector.
DEFINITION (Partial Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 = (x_{0,1},\ldots,x_{0,d}) \in D\) be an interior point of \(D\). The partial derivative of \(f\) at \(\mathbf{x}_0\) with respect to \(x_i\) is
provided the limit exists. If \(\frac{\partial f (\mathbf{x}_0)}{\partial x_i}\) exists and is continuous in an open ball around \(\mathbf{x}_0\) for all \(i\), then we say that \(f\) continuously differentiable at \(\mathbf{x}_0\). \(\natural\)
DEFINITION (Gradient) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume \(f\) is continuously differentiable at \(\mathbf{x}_0\). The (column) vector
is called the gradient of \(f\) at \(\mathbf{x}_0\). \(\natural\)
Note that the gradient is itself a function of \(\mathbf{x}\). In fact, unlike \(f\), it is a vector-valued function.
Figure: Gradient as a function (Source)
\(\bowtie\)
EXAMPLE: Consider the affine function
where \(\mathbf{x} = (x_1, \ldots, x_d)^T, \mathbf{q} = (q_1, \ldots, q_d)^T \in \mathbb{R}^d\). The partial derivatives of the linear term are given by
So the gradient of \(f\) is
\(\lhd\)
EXAMPLE: Consider the quadratic function
where \(\mathbf{x} = (x_1, \ldots, x_d)^T, \mathbf{q} = (q_1, \ldots, q_d)^T \in \mathbb{R}^d\) and \(P \in \mathbb{R}^{d \times d}\). The partial derivatives of the quadratic term are given by
where we used that all terms not including \(x_i\) have partial derivative \(0\).
This last expression is
So the gradient of \(f\) is
If \(P\) is symmetric, this further simplifies to \(\nabla f(\mathbf{x}) = P \,\mathbf{x} + \mathbf{q}\). \(\lhd\)
It will be useful to compute the derivative of a function \(f(\mathbf{x})\) of several variables along a parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t))^T \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). The following result is a special case of an important fact. We will use the following notation
We say that \(\mathbf{g}(t)\) is continuously differentiable at \(t = t_0\) if each of its component is.
EXAMPLE: (Parametric Line) The straight line between \(\mathbf{x}_0 = (x_{0,1},\ldots,x_{0,d})^T\) and \(\mathbf{x}_1 = (x_{1,1},\ldots,x_{1,d})^T\) in \(\mathbb{R}^d\) can be parametrized as
where \(t\) goes from \(0\) (at which \(\mathbf{g}(0) = \mathbf{x}_0\)) to \(1\) (at which \(\mathbf{g}(1) = \mathbf{x}_1\)).
Then
so that
\(\lhd\)
Before stating the Chain Rule, first recall the Mean Value Theorem in the single-variable case, which we will use in the proof.
THEOREM (Mean Value) Let \(f : [a,b] \to \mathbb{R}\) be a continuous function and assume that its derivative exists on \((a,b)\). Then there is \(a < c < b\) such that
\(\sharp\)
Recall the Chain Rule in the single-variable case. Quoting Wikipedia:
The simplest form of the chain rule is for real-valued functions of one real variable. It states that if \(g\) is a function that is differentiable at a point \(c\) (i.e. the derivative \(g'(c)\) exists) and f is a function that is differentiable at \(g(c)\), then the composite function \({\displaystyle f\circ g}\) is differentiable at \(c\), and the derivative is \({\displaystyle (f\circ g)'(c)=f'(g(c))\cdot g'(c)}\).
Here is a straightforward generalization of the Chain Rule.
THEOREM (Chain Rule) Let \(f : D_1 \to \mathbb{R}\), where \(D_1 \subseteq \mathbb{R}\), and let \(g : D_2 \to \mathbb{R}\), where \(D_2 \subseteq \mathbb{R}^d\). Assume that \(f\) is continuously differentiable at \(g(\mathbf{x}_0)\), an interior point of \(D_1\), and that \(g\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_2\). Then
\(\sharp\)
Proof: We apply the Chain Rule for functions of one variable to the partial derivatives. For all \(i\),
Collecting the partial derivatives in a vector gives the claim. \(\square\)
Here is a different generalization of the Chain Rule. Again the composition \(f \circ \mathbf{g}\) denotes the function \(f \circ \mathbf{g}(t) = f (\mathbf{g}(t))\).
THEOREM (Chain Rule) Let \(f : D_1 \to \mathbb{R}\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^d\), where \(D_2 \subseteq \mathbb{R}\). Assume that \(f\) is continuously differentiable at \(\mathbf{g}(t_0)\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(t_0\), an interior point of \(D_2\). Then
\(\sharp\)
Proof: To simplify the notation, suppose that \(f\) is a real-valued function of \(\mathbf{x} = (x_1, \ldots, x_d)\) whose components are themselves functions of \(t \in \mathbb{R}\). Assume \(f\) is continuously differentiable at \(\mathbf{x}(t)\). To compute the total derivative \(\frac{\mathrm{d} f(t)}{\mathrm{d} t}\), let \(\Delta x_k = x_k(t + \Delta t) - x_k(t)\), \(x_k = x_k(t)\) and
We seek to compute the limit \(\lim_{\Delta t \to 0} \frac{\Delta f}{\Delta t}\). To relate this limit to partial derivatives of \(f\), we re-write \(\Delta f\) as a telescoping sum where each term involves variation of a single variable \(x_k\). That is,
Applying the Mean Value Theorem to each term gives
where \(0 < \theta_k < 1\) for \(k=1,\ldots,d\). Dividing by \(\Delta t\), taking the limit \(\Delta t \to 0\) and using the fact that \(f\) is continuously differentiable, we get
\(\square\)
As a first application of the Chain Rule, we generalize the Mean Value Theorem to the case of several variables. We will use this result later to prove a multivariable Taylor expansion result that will play a central role in this chapter.
THEOREM (Mean Value) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\). Let \(\mathbf{x}_0 \in D\) and \(\delta > 0\) be such that \(B_\delta(\mathbf{x}_0) \subseteq D\). If \(f\) is continuously differentiable on \(B_\delta(\mathbf{x}_0)\), then for any \(\mathbf{x} \in B_\delta(\mathbf{x}_0)\)
for some \(\xi \in (0,1)\), where \(\mathbf{p} = \mathbf{x} - \mathbf{x}_0\). \(\sharp\)
One way to think of the Mean Value Theorem is as a \(0\)-th order Taylor expansion. It says that, when \(\mathbf{x}\) is close to \(\mathbf{x}_0\), the value \(f(\mathbf{x})\) is close to \(f(\mathbf{x}_0)\) in a way that can be controlled in terms of the gradient in the neighborhood of \(\mathbf{x}_0\). From this point of view, the term \(\nabla f(\mathbf{x}_0 + \xi \mathbf{p})^T \mathbf{p}\) is called the Lagrange remainder.
Proof idea: We apply the single-variable result and the Chain Rule.
Proof: Let \(\phi(t) = f(\boldsymbol{\alpha}(t))\) where \(\boldsymbol{\alpha}(t) = \mathbf{x}_0 + t \mathbf{p}\). Observe that \(\phi(0) = f(\mathbf{x}_0)\) and \(\phi(1) = f(\mathbf{x})\). By the Chain Rule and the Parametric Line Example,
In particular, \(\phi\) has a continuous first derivative on \([0,1]\). By the Mean Value Theorem in the single-variable case
for some \(\xi \in (0,t)\). Plugging in the expressions for \(\phi(0)\) and \(\phi'(\xi)\) and taking \(t=1\) gives the claim. \(\square\)
6.2.2. Second-order derivatives#
One can also define higher-order derivatives. We start with the single-variable case, where \(f : D \to \mathbb{R}\) with \(D \subseteq \mathbb{R}\) and \(x_0 \in D\) is an interior point of \(D\). Note that, if \(f'\) exists in \(D\), then it is itself a function of \(x\). Then the second derivative at \(x_0\) is
provided the limit exists.
In the several variable case, we have the following:
DEFINITION (Second Partial Derivatives and Hessian) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(f\) is continuously differentiable in an open ball around \(\mathbf{x}_0\). Then \(\partial f(\mathbf{x})/\partial x_i\) is itself a function of \(\mathbf{x}\) and its partial derivative with respect to \(x_j\), if it exists, is denoted by
To simplify the notation, we write this as \(\partial^2 f(\mathbf{x}_0)/\partial x_i^2\) when \(j = i\). If \(\partial^2 f(\mathbf{x})/\partial x_j \partial x_i\) and \(\partial^2 f(\mathbf{x})/\partial x_i^2\) exist and are continuous in an open ball around \(\mathbf{x}_0\) for all \(i, j\), we say that \(f\) is twice continuously differentiable at \(\mathbf{x}_0\).
The matrix of second derivatives is called the Hessian and is denoted by
\(\natural\)
Like \(f\) and the gradient \(\nabla f\), the Hessian \(\mathbf{H}_f\) is a function of \(\mathbf{x}\). It is a matrix-valued function however.
When \(f\) is twice continuously differentiable at \(\mathbf{x}_0\), its Hessian is a symmetric matrix.
THEOREM (Symmetry of the Hessian) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(f\) is twice continuously differentiable at \(\mathbf{x}_0\). Then for all \(i \neq j\)
\(\sharp\)
Proof idea: Two applications of the Mean Value Theorem show that the limits can be interchanged.
Proof: By definition of the partial derivative,
for some \(\theta_j \in (0,1)\). Note that, on the third line, we rearranged the terms and, on the fourth line, we applied the Mean Value Theorem to \(f(\mathbf{x}_0 + h_i \mathbf{e}_i + h_j \mathbf{e}_j) - f(\mathbf{x}_0 + h_j \mathbf{e}_j)\) as a continuously differentiable function of \(h_j\).
Because \(\partial f/\partial x_j\) is continuously differentiable in an open ball around \(\mathbf{x}_0\), a second application of the Mean Value Theorem gives for some \(\theta_i \in (0,1)\)
The claim then follows from the continuity of \(\partial^2 f/\partial x_i \partial x_j\). \(\square\)
EXAMPLE: Consider the quadratic function
Recall that the gradient of \(f\) is
To simplify the calculation, let \(B = \frac{1}{2}[P + P^T]\) and denote the rows of \(B\) by \(\mathbf{b}_1^T, \ldots,\mathbf{b}_d^T\).
Each component of \(\nabla f\) is an affine function of \(\mathbf{x}\), specifically,
Row \(i\) of the Hessian is simply the gradient transposed of \(\frac{\partial f (\mathbf{x})}{\partial x_i}\) which, by our previous results, is
Putting this together we get
Observe that this is indeed a symmetric matrix. \(\lhd\)
6.2.3. Jacobian and chain rule#
Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable.
DEFINITION (Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\). The derivative of \(f\) at \(x_0\) is
provided the limit exists. \(\natural\)
A different way to put this is that \(f'(x)\) is the slope of the tangent line to \(f\) at \(x\). Formally, one can approximate \(f(x)\) by a linear function in the neighborhood of \(x\) as follows
where \(r(h)\) is negligible compared to \(h\) in the sense that
Indeed, define
Then by definition of the derivative
For vector-valued functions, we have the following generalization. Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x} \in D\) be an interior point of \(D\). We say that \(\mathbf{f}\) is diffentiable at \(\mathbf{x}\) if there exists a matrix \(A \in \mathbb{R}^{m \times d}\) such that
where
The matrix \(\mathbf{f}'(\mathbf{x}) = A\) is called the differential of \(\mathbf{f}\) at \(\mathbf{x}\), and we see that the affine map \(\mathbf{f}(\mathbf{x}) + A \mathbf{h}\) provides an approximation of \(\mathbf{f}\) in the neighborhood of \(\mathbf{x}\).
We will not derive the full theory here. In the case where each component of \(\mathbf{f}\) has continuous partial derivatives in a neighborhood of \(\mathbf{x}\), then the differential exists and is equal to the Jacobian, as defined next.
DEFINITION (Jacobian) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\) where \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists for all \(i, j\). The Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\) is the \(d \times m\) matrix
\(\natural\)
THEOREM (Differential and Jacobian) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists and is continous is an open ball around \(\mathbf{x}_0\) for all \(i, j\). Then the differential at \(\mathbf{x}_0\) is equal to the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\). \(\sharp\)
Recall that for any \(A, B\) for which \(AB\) is well-defined it holds that
This applies in particular when \(B\) is a column vector, in which case \(\|B\|_F\) is its Euclidean norm.
Proof: By the Mean Value Theorem, for each \(i\), there is \(\xi_{\mathbf{h},i} \in (0,1)\) such that
Define
Hence we have
Taking a limit as \(\mathbf{h}\) goes to \(0\), we get
by continuity of the partial derivatives. \(\square\)
EXAMPLE: An example of a vector-valued function is
Its Jacobian is
\(\lhd\)
EXAMPLE: (Gradient and Jacobian) For a continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian reduces to the row vector
where \(\nabla f(\mathbf{x}_0)\) is the gradient of \(f\) at \(\mathbf{x}_0\). \(\lhd\)
EXAMPLE: (Hessian and Jacobian) For a twice continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian of its gradient is
that is, the Hessian of \(f\) at \(\mathbf{x}_0\). \(\lhd\)
EXAMPLE: (Parametric Curve and Jacobian) Consider the parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t))^T \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). Assume that \(\mathbf{g}(t)\) is continuously differentiable at \(t = t_0\), that is, each of its component is.
Then
\(\lhd\)
EXAMPLE: (Affine Map) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m)^T \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as
This is an affine map. Note in particular that, in the case \(\mathbf{b} = \mathbf{0}\) of a linear map,
Denote the rows of \(A\) by \(\mathbf{a}_1^T,\ldots,\mathbf{a}_m^T\).
We compute the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\). Note that
So
\(\lhd\)
Functions are often obtained from the composition of simpler ones. We will use the vector notation \(\mathbf{h} = \mathbf{g} \circ \mathbf{f}\) for the function \(\mathbf{h}(\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x}))\).
LEMMA (Composition of Continuous Functions) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuous at \(\mathbf{x}_0\) and that \(\mathbf{g}\) is continuous \(\mathbf{f}(\mathbf{x}_0)\). Then \(\mathbf{g} \circ \mathbf{f}\) is continuous at \(\mathbf{x}_0\). \(\flat\)
The Chain Rule gives a formula for the Jacobian of a composition.
THEOREM (Chain Rule) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then
as a product of matrices. \(\sharp\)
Intuitively, the Jacobian provides a linear approximation of the function in the neighborhood of a point. The composition of linear maps corresponds to the product of the associated matrices. Similarly, the Jacobian of a composition is the product of the Jacobians.
Proof: By the Chain Rule for a multivariable function over a parametric curve, we have that
where the notation \(\frac{\partial g} {\partial f_k}\) indicates the partial derivative of \(g\) with respect to its \(k\)-th component. In matrix form, the claim follows. \(\square\)
EXAMPLE: (Affine Map continued) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define again the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). In addition, let \(C \in \mathbb{R}^{p \times m}\) and \(\mathbf{d} \in \mathbb{R}^{p}\), define \(\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^p\) as \(\mathbf{g}(\mathbf{y}) = C \mathbf{y} + \mathbf{d}\).
Then
for all \(\mathbf{x} \in \mathbb{R}^d\).
This is consistent with the observation that
\(\lhd\)
EXAMPLE: Suppose we want to compute the gradient of the function
We could apply the Chain Rule directly, but to illustrate the perspective that is coming up in a subsequent section, we think of \(f\) as a composition of “simpler” vector-valued functions. Specifically, let
Then \(f(x_1, x_2) = h(\mathbf{g}(x_1, x_2)) = h \circ \mathbf{g}(x_1, x_2)\).
By the Chain Rule, we can compute the gradient of \(f\) by first computing the Jacobians of \(\mathbf{g}\) and \(h\). We have already computed the Jacobian of \(\mathbf{g}\)
The Jacobian of \(h\) is
The Chain Rule stipulates that
You can check directly that this is indeed the correct gradient (transposed). \(\lhd\)
6.2.4. Brief introduction to automatic differentiation#
We illustrate the use of automatic differentiation to compute gradients.
Quoting Wikipedia:
In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.
We will use PyTorch. It uses tensors, which in many ways behave similarly to Numpy arrays. See here for a quick introduction. Here is an example. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option requires_grad=True
, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process.
# Initialize variables
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)
The function .backward()
computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad
. We first define the function. Note that we use
torch.exp
, the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of **
for taking a power. Here is a list of operations on tensors in PyTorch.
# Define function
f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)
# Perform automatic differentiation
f.backward() # Compute gradients
# Print gradients
print(x1.grad) # df/dx
print(x2.grad) # df/dy
tensor(20.7781)
tensor(8.3891)
The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use torch.sum
for taking a sum of the arguments.
# New variables for the second example
z = torch.tensor([1., 2., 3.], requires_grad=True)
# Perform automatic differentiation
g = torch.sum(z ** 2)
g.backward() # Compute gradients
# Print gradient
print(z.grad) # gradient is (2 z_1, 2 z_2, 2 z_3)
tensor([2., 4., 6.])
Here is another typical example in a data science context.
# Variables for the third example
X = torch.randn(3, 2) # Random dataset (features)
y = torch.tensor([[1., 0., 1.]]) # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True) # Parameter assignment
# Perform automatic differentiation
predict = X @ theta # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2) # Loss function
loss.backward() # Compute gradients
# Print gradient
print(theta.grad) # gradient of loss
tensor([[0.3187],
[2.6270]])
LEARNING BY CHATTING: Ask your favorite AI chatbot to explain how to compute a second derivative using PyTorch (it’s bit tricky). Ask for code that you can apply to the previous examples.