\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)
8.2. Background: Jacobian, chain rule, and a brief introduction to automatic differentiation#
We introduce the Jacobian of a vector-valued functions of several variables as well as the Chain Rule for this more general setting. We also give a brief introduction to automatic differentiation. Some required matrix algebra is also discussed.
8.2.1. More matrix algebra: Hadamard and Kronecker products#
It will be convenient to introduce more matrix notation.
First, we introduce the Hadamard product\(\idx{Hadamard product}\xdi\) and division\(\idx{Hadamard division}\xdi\). The Hadamard product of two matrices (or vectors) of the same dimension, \(A = (a_{i,j})_{i \in [n], j \in [m]}, B = (b_{i,j})_{i\in [n], j \in [m]} \in \mathbb{R}^{n \times m}\), is defined as the element-wise product
Similarly the Hadamard division is defined as the element-wise division
where we assume that \(b_{i,j} \neq 0\) for all \(i,j\).
EXAMPLE: (Illustration, continued) Continuing our previous illustrative example
\(\lhd\)
Recall that \(\mathbf{1}\) is the all-one vector and that, for \(\mathbf{x} = (x_1,\ldots,x_n) \in \mathbb{R}^n\), \(\mathrm{diag}(\mathbf{x}) \in \mathbb{R}^{n \times n}\) is the diagonal matrix with diagonal entries \(x_1,\ldots,x_n\).
LEMMA (Properties of the Hadamard Product) \(\idx{properties of the Hadamard product}\xdi\) Let \(\mathbf{a} = (a_1,\ldots,a_n), \mathbf{b} = (b_1,\ldots,b_n), \mathbf{c} = (c_1,\ldots,c_n) \in \mathbb{R}^n\). Then the following hold:
a) \(\mathrm{diag}(\mathbf{a}) \, \mathbf{b} = \mathrm{diag}(\mathbf{a} \odot \mathbf{b})\);
b) \(\mathbf{a}^T(\mathbf{b} \odot \mathbf{c}) = \mathbf{1}^T(\mathbf{a} \odot \mathbf{b} \odot \mathbf{c})\)
and, provided \(a_i \neq 0\) for all \(i\), the following hold as well:
c) \(\mathrm{diag}(\mathbf{a}) \, (\mathbf{b} \oslash \mathbf{a}) = \mathrm{diag}(\mathbf{b})\);
d) \(\mathbf{a}^T \, (\mathbf{b} \oslash \mathbf{a}) = \mathbf{1}^T \mathbf{b}\).
\(\flat\)
Proof: a) The product of a diagonal matrix and a vector produces a new vector whose original coordinates are multiplied by the corresponding diagonal entry. That proves the first claim.
b) We have
c) and d) follow respectively from a) and b).
\(\square\)
Second, we introduce the Kronecker product\(\idx{Kronecker product}\xdi\). Let \(A = (a_{i,j})_{i \in [n], j \in [m]} \in \mathbb{R}^{n \times m}\) and \(B = (b_{i,j})_{i \in [p], j \in [q]} \in \mathbb{R}^{p \times q}\) be arbitrary matrices. Their Kronecker product, denoted \(A \otimes B \in \mathbb{R}^{np \times mq}\), is the following matrix in block form
EXAMPLE: Here is a simple illustrative example from Wikipedia:
\(\lhd\)
EXAMPLE: (Outer product) \(\idx{outer product}\xdi\) Here is another example we have encoutered previously, the outer product of two vectors \(\mathbf{u} = (u_1,\ldots,u_n) \in \mathbb{R}^n\) and \(\mathbf{v} = (v_1,\ldots, v_m) \in \mathbb{R}^m\). Recall that the outer product is defined in block form as the \(n \times m\) matrix
Equivalently,
\(\lhd\)
EXAMPLE: (continued) In the previous example the Kronecker product turned out to be commutative (i.e., we had \(\mathbf{v}^T \otimes \mathbf{u} = \mathbf{u} \otimes \mathbf{v}^T\)). This is not the case in general. Going back to the first example above, note that
You can check that this is different from what we obtained in the opposite order. \(\lhd\)
The proof of the following lemma is left as an exercise.
LEMMA (Properties of the Kronecker Product) \(\idx{properties of the Kronecker Product}\xdi\) The Kronecker product has the following properties:
a) If \(B, C\) are matrices of the same dimension
b) If \(A, B, C, D\) are matrices of such size that one can form the matrix products \(AC\) and \(BD\), then
c) If \(A, C\) are matrices of the same dimension and \(B, D\) are matrices of the same dimension, then
d) If \(A,B\) are invertible, then
e) The transpose of \(A \otimes B\) is
f) If \(\mathbf{u}\) is a column vector and \(A, B\) are matrices of such size that one can form the matrix product \(AB\), then
Similarly,
\(\flat\)
8.2.2. Jacobian#
Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable. A different way to put this is that \(f'(x)\) is the slope of the tangent line to \(f\) at \(x\). Formally, one can approximate \(f(x)\) by a linear function in the neighborhood of \(x\) as follows
where \(r(h)\) is negligible compared to \(h\) in the sense that
Indeed, define
Then by definition of the derivative
For vector-valued functions, we have the following generalization. Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x} \in D\) be an interior point of \(D\). We say that \(\mathbf{f}\) is diffentiable\(\idx{diffentiable}\xdi\) at \(\mathbf{x}\) if there exists a matrix \(A \in \mathbb{R}^{m \times d}\) such that
where
The matrix \(\mathbf{f}'(\mathbf{x}) = A\) is called the differential\(\idx{differential}\xdi\) of \(\mathbf{f}\) at \(\mathbf{x}\), and we see that the affine map \(\mathbf{f}(\mathbf{x}) + A \mathbf{h}\) provides an approximation of \(\mathbf{f}\) in the neighborhood of \(\mathbf{x}\).
We will not derive the full theory here. In the case where each component of \(\mathbf{f}\) has continuous partial derivatives in a neighborhood of \(\mathbf{x}\), then the differential exists and is equal to the Jacobian, as defined next.
DEFINITION (Jacobian) \(\idx{Jacobian}\xdi\) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\) where \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists for all \(i, j\). The Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\) is the \(d \times m\) matrix
\(\natural\)
THEOREM (Differential and Jacobian) \(\idx{differential and Jacobian theorem}\xdi\) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists and is continous is an open ball around \(\mathbf{x}_0\) for all \(i, j\). Then the differential at \(\mathbf{x}_0\) is equal to the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\). \(\sharp\)
Recall that for any \(A, B\) for which \(AB\) is well-defined it holds that \(\|A B \|_F \leq \|A\|_F \|B\|_F\). This applies in particular when \(B\) is a column vector, in which case \(\|B\|_F\) is its Euclidean norm.
Proof: By the Mean Value Theorem, for each \(i\), there is \(\xi_{\mathbf{h},i} \in (0,1)\) such that
Define
Hence we have
Taking a limit as \(\mathbf{h}\) goes to \(0\), we get
by continuity of the partial derivatives. \(\square\)
EXAMPLE: An example of a vector-valued function is
Its Jacobian is
\(\lhd\)
EXAMPLE: (Gradient and Jacobian) For a continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian reduces to the row vector
where \(\nabla f(\mathbf{x}_0)\) is the gradient of \(f\) at \(\mathbf{x}_0\). \(\lhd\)
EXAMPLE: (Hessian and Jacobian) For a twice continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian of its gradient is
that is, the Hessian of \(f\) at \(\mathbf{x}_0\). \(\lhd\)
EXAMPLE: (Parametric Curve and Jacobian) Consider the parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t))^T \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). Assume that \(\mathbf{g}(t)\) is continuously differentiable at \(t\), that is, each of its component is.
Then
\(\lhd\)
EXAMPLE: (Affine Map) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m)^T \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as
This is an affine map. Note in particular that, in the case \(\mathbf{b} = \mathbf{0}\) of a linear map,
Denote the rows of \(A\) by \(\mathbf{a}_1^T,\ldots,\mathbf{a}_m^T\).
We compute the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\). Note that
So
\(\lhd\)
The following important example is a less straightforward application of the Jacobian.
It will be useful to introduce the vectorization \(\mathrm{vec}(A) \in \mathbb{R}^{nm}\) of a matrix \(A = (a_{i,j})_{i,j} \in \mathbb{R}^{n \times m}\) as the vector
That is, it is obtained by stacking the columns of \(A\) on top of each other.
EXAMPLE: (Jacobian of a Linear Map with Respect to its Matrix) We take a different tack on the previous example. In data science applications, it will be useful to compute the Jacobian of a linear map \(X \mathbf{z}\) – with respect to the matrix \(X \in \mathbb{R}^{n \times m}\). Specifically, for a fixed \(\mathbf{z} \in \mathbb{R}^{m}\), letting \((\mathbf{x}^{(i)})^T\) be the \(i\)-th row of \(X\) we define the function
where \(\mathbf{x} = \mathrm{vec}(X^T) = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})\).
To compute the Jacobian, let us look at its columns that correspond to the variables in \(\mathbf{x}^{(k)}\), that is, columns \(\alpha_k = (k-1) m + 1\) to \(\beta_k = k m\). Note that only the \(k\)-th component of \(\mathbf{f}\) depends on \(\mathbf{x}^{(k)}\), so the rows \(\neq k\) of \(J_{\mathbf{f}}(\mathbf{x})\) are \(0\) for the corresponding columns.
Row \(k\) on the other hand is \(\mathbf{z}^T\) from our previous formula for the gradient of an affine map. Hence one way to write the columns \(\alpha_k\) to \(\beta_k\) of \(J_{\mathbf{f}}(\mathbf{x})\) is \(\mathbf{e}_k \mathbf{z}^T\), where here \(\mathbf{e}_k \in \mathbb{R}^{n}\) is the standard basis of \(\mathbb{R}^{n}\).
So \(J_{\mathbf{f}}(\mathbf{x})\) can be written in block form as
where the last equality is a definition. \(\lhd\)
We will need one more wrinkle.
EXAMPLE: (Jacobian of a Linear Map with Respect to its Input and Matrix) Consider again the linear map \(X \mathbf{z}\) – this time as a function of both the matrix \(X \in \mathbb{R}^{n \times m}\) and the vector \(\mathbf{z} \in \mathbb{R}^{m}\). That is, letting again \((\mathbf{x}^{(i)})^T\) be the \(i\)-th row of \(X\), we define the function
where as before \(\mathbf{x} = \mathrm{vec}(X^T) = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})\).
To compute the Jacobian, we think of it as a block matrix and use the two previous example. The columns of \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) corresponding to the variables in \(\mathbf{z}\), that is, columns \(1\) to \(m\), are
The columns of \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) corresponding to the variables in \(\mathbf{x}\), that is, columns \(m + 1\) to \(m + nm\), are the matrix \(\mathbb{B}_{n}[\mathbf{z}]\). Note that, in both \(\mathbb{A}_{n}[\mathbf{x}]\) and \(\mathbb{B}_{n}[\mathbf{z}]\), the subscript \(n\) indicates the number of rows of the matrix. The number of columns is determined by \(n\) and the size of the input vector:
the length of \(\mathbf{x}\) divided by \(n\) for \(\mathbb{A}_{n}[\mathbf{x}]\);
the length of \(\mathbf{z}\) multiplied by \(n\) for \(\mathbb{B}_{n}[\mathbf{z}]\).
So \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) can be written in block form as
\(\lhd\)
EXAMPLE: (Elementwise Function) Let \(f : D \to \mathbb{R}\), with \(D \subseteq \mathbb{R}\), be a continuously differentiable real-valued function of a single variable. For \(n \geq 2\), consider applying \(f\) to each entry of a vector \(\mathbf{x} \in \mathbb{R}^n\), that is, let \(\mathbf{f} : D^n \to \mathbb{R}^n\) with
The Jacobian of \(\mathbf{f}\) can be computed from \(f'\), the derivative of the single-variable case. Indeed, letting \(\mathbf{x} = (x_1,\ldots,x_n)\) be such that \(x_i\) is an interior point of \(D\) for all \(i\),
while for \(\ell \neq j\)
as \(f_\ell(\mathbf{x})\) does not in fact depend on \(x_j\). In other words, the \(j\)-th column of the Jacobian is \(f'(x_j) \,\mathbf{e}_j\), where again \(\mathbf{e}_{j}\) is the \(j\)-th standard basis vector in \(\mathbb{R}^{n}\).
So \(J_{\mathbf{f}}(\mathbf{x})\) is the diagonal matrix with diagonal entries \(f'(x_j)\), \(j=1, \ldots, n\), which we denote by
\(\lhd\)
8.2.3. Generalization of the Chain Rule#
As we have seen, functions are often obtained from the composition of simpler ones. We will use the vector notation \(\mathbf{h} = \mathbf{g} \circ \mathbf{f}\) for the function \(\mathbf{h}(\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x}))\).
LEMMA (Composition of Continuous Functions) \(\idx{composition of continuous functions lemma}\xdi\) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuous at \(\mathbf{x}_0\) and that \(\mathbf{g}\) is continuous \(\mathbf{f}(\mathbf{x}_0)\). Then \(\mathbf{g} \circ \mathbf{f}\) is continuous at \(\mathbf{x}_0\). \(\flat\)
The Chain Rule gives a formula for the Jacobian of a composition.
THEOREM (Chain Rule) \(\idx{chain rule}\xdi\) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then
as a product of matrices. \(\sharp\)
Intuitively, the Jacobian provides a linear approximation of the function in the neighborhood of a point. The composition of linear maps corresponds to the product of the associated matrices. Similarly, the Jacobian of a composition is the product of the Jacobians.
Proof: To avoid confusion, we think of \(\mathbf{f}\) and \(\mathbf{g}\) as being functions of variables with different names, specifically,
and
We apply the Chain Rule for a real-valued function over a parametric vector curve. That is, we think of
as a function of \(x_j\) only with all other \(x_i\)s fixed.
We get that
where, as before, the notation \(\frac{\partial g_i} {\partial y_k}\) indicates the partial derivative of \(g_i\) with respect to its \(k\)-th component. In matrix form, the claim follows. \(\square\)
EXAMPLE: (Affine Map continued) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define again the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). In addition, let \(C \in \mathbb{R}^{p \times m}\) and \(\mathbf{d} \in \mathbb{R}^{p}\), define \(\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^p\) as \(\mathbf{g}(\mathbf{y}) = C \mathbf{y} + \mathbf{d}\).
Then
for all \(\mathbf{x} \in \mathbb{R}^d\).
This is consistent with the observation that
\(\lhd\)
EXAMPLE: Suppose we want to compute the gradient of the function
We could apply the Chain Rule directly, but to illustrate the perspective that is coming up, we think of \(f\) as a composition of “simpler” vector-valued functions. Specifically, let
Then \(f(x_1, x_2) = h(\mathbf{g}(x_1, x_2)) = h \circ \mathbf{g}(x_1, x_2)\).
By the Chain Rule, we can compute the gradient of \(f\) by first computing the Jacobians of \(\mathbf{g}\) and \(h\). We have already computed the Jacobian of \(\mathbf{g}\)
The Jacobian of \(h\) is
The Chain Rule stipulates that
You can check directly (i.e., without the composition) that this is indeed the correct gradient (transposed).
Alternatively, it is instructive to “expand” the Chain Rule as we did in its proof. Specifically,
Note that this corresponds to multiplying \(J_h(\mathbf{g}(x_1,x_2))\) by the first column of \(J_{\mathbf{g}}(x_1, x_2)\).
Similarly
This corresponds to multiplying \(J_h(\mathbf{g}(x_1,x_2))\) by the second column of \(J_{\mathbf{g}}(x_1, x_2)\). \(\lhd\)
CHAT & LEARN The Jacobian determinant has important applications in change of variables for multivariable integrals. Ask your favorite AI chatbot to explain this application and provide an example of using the Jacobian determinant in a change of variables for a double integral. \(\ddagger\)
8.2.4. Brief introduction to automatic differentiation in PyTorch#
We illustrate the use of automatic differentiation\(\idx{automatic differentiation}\xdi\) to compute gradients in PyTorch.
Quoting Wikipedia:
In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.
Automatic differentiation in PyTorch We will use PyTorch. It uses tensors\(\idx{tensor}\xdi\), which in many ways behave similarly to Numpy arrays. See here for a quick introduction. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option requires_grad=True
, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process. The function .backward()
computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad
.
NUMERICAL CORNER: This is better understood through an example.
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)
We define the function. Note that we use
torch.exp
, the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of **
for taking a power. Here is a list of operations on tensors in PyTorch.
f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)
f.backward()
print(x1.grad) # df/dx
print(x2.grad) # df/dy
tensor(20.7781)
tensor(8.3891)
The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use torch.sum
for taking a sum of the arguments.
z = torch.tensor([1., 2., 3.], requires_grad=True)
g = torch.sum(z ** 2)
g.backward()
print(z.grad) # gradient is (2 z_1, 2 z_2, 2 z_3)
tensor([2., 4., 6.])
Here is another typical example in a data science context.
X = torch.randn(3, 2) # Random dataset (features)
y = torch.tensor([[1., 0., 1.]]) # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True) # Parameter assignment
predict = X @ theta # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2) # Loss function
loss.backward() # Compute gradients
print(theta.grad) # gradient of loss
tensor([[ 3.6615],
[27.0129]])
CHAT & LEARN Ask your favorite AI chatbot to explain how to compute a second derivative using PyTorch (it’s bit tricky). Ask for code that you can apply to the previous examples. (Open In Colab) \(\ddagger\)
\(\unlhd\)
Implementing gradient descent in PyTorch Rather than explicitly specifying the gradient function, we could use PyTorch to compute it automatically. This is done next. Note that the descent update is done within with torch.no_grad()
, which ensures that the update operation itself is not tracked for gradient computation. Here the input x0
as well as the output xk.numpy(force=True)
are Numpy arrays. The function torch.Tensor.numpy()
converts a PyTorch tensor to a Numpy array (see the documentation for an explanation of the force=True
option). Also, quoting ChatGPT:
In the given code,
.item()
is used to extract the scalar value from a tensor. In PyTorch, when you perform operations on tensors, you get back tensors as results, even if the result is a single scalar value..item()
is used to extract this scalar value from the tensor.
def gd_with_ad(f, x0, alpha=1e-3, niters=int(1e6)):
xk = torch.tensor(x0, requires_grad=True, dtype=torch.float)
for _ in range(niters):
value = f(xk)
value.backward()
with torch.no_grad():
xk -= alpha * xk.grad
xk.grad.zero_()
return xk.numpy(force=True), f(xk).item()
NUMERICAL CORNER: We revisit a previous example.
def f(x):
return x**3
print(gd_with_ad(f, 2, niters=int(1e4)))
(array(0.03277362, dtype=float32), 3.5202472645323724e-05)
print(gd_with_ad(f, -2, niters=100))
(array(-4.9335055, dtype=float32), -120.07894897460938)
\(\unlhd\)
CHAT & LEARN The section briefly mentions that automatic differentiation is distinct from symbolic differentiation and numerical differentiation. Ask your favorite AI chatbot to explain in more detail the differences between these three methods of computing derivatives. \(\ddagger\)
Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)
1 Let \(A \in \mathbb{R}^{n \times m}\) and \(B \in \mathbb{R}^{p \times q}\). What are the dimensions of the Kronecker product \(A \otimes B\)?
a) \(n \times m\)
b) \(p \times q\)
c) \(np \times mq\)
d) \(nq \times mp\)
2 If \(f: \mathbb{R}^d \rightarrow \mathbb{R}^m\) is a continuously differentiable function, what is its Jacobian \(J_f(x_0)\) at an interior point \(x_0\) of its domain?
a) A scalar representing the rate of change of \(f\) at \(x_0\).
b) A vector in \(\mathbb{R}^m\) representing the direction of steepest ascent of \(f\) at \(x_0\).
c) An \(m \times d\) matrix of partial derivatives of the component functions of \(f\) at \(x_0\).
d) The Hessian matrix of \(f\) at \(x_0\).
3 In the context of the Chain Rule, if \(f: \mathbb{R}^2 \to \mathbb{R}^3\) and \(g: \mathbb{R}^3 \to \mathbb{R}\), what is the dimension of the Jacobian matrix \(J_{g \circ f}(x)\)?
a) \(3 \times 2\)
b) \(1 \times 3\)
c) \(2 \times 3\)
d) \(1 \times 2\)
4 Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Which of the following is correct according to the Chain Rule?
a) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{f}}(\mathbf{x}_0) \, J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0))\)
b) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0)) \, J_{\mathbf{f}}(\mathbf{x}_0)\)
c) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{x}_0)) \, J_{\mathbf{g}}(\mathbf{x}_0)\)
d) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{x}_0) \, J_{\mathbf{f}}(\mathbf{g}(\mathbf{x}_0))\)
5 Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). What is the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\)?
a) \(J_{\mathbf{f}}(\mathbf{x}_0) = A^T\)
b) \(J_{\mathbf{f}}(\mathbf{x}_0) = A \mathbf{x}_0 + \mathbf{b}\)
c) \(J_{\mathbf{f}}(\mathbf{x}_0) = A\)
d) \(J_{\mathbf{f}}(\mathbf{x}_0) = \mathbf{b}\)
Answer for 1: c. Justification: The text defines the Kronecker product as a matrix in block form with dimensions \(np \times mq\).
Answer for 2: c. Justification: The text defines the Jacobian of a vector-valued function as a matrix of partial derivatives.
Answer for 3: d. Justification: The composition \(g \circ f\) maps \(\mathbb{R}^2 \to \mathbb{R}\), hence the Jacobian matrix \(J_{g \circ f}(x)\) is \(1 \times 2\).
Answer for 4: b. Justification: The text states: “The Chain Rule gives a formula for the Jacobian of a composition. […] Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then
as a product of matrices.”
Answer for 5: c. Justification: The text states: “Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m)^T \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). […] So \(J_{\mathbf{f}}(\mathbf{x}) = A.\)”