Background: Jacobian, chain rule, and a brief introduction to automatic differentiation

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

8.2. Background: Jacobian, chain rule, and a brief introduction to automatic differentiation#

We introduce the Jacobian of a vector-valued function of several variables as well as the Chain Rule for this more general setting. We also give a brief introduction to automatic differentiation. We begin with some additional matrix algebra.

8.2.1. More matrix algebra: Hadamard and Kronecker products#

First, we introduce the Hadamard product\(\idx{Hadamard product}\xdi\) and division\(\idx{Hadamard division}\xdi\). The Hadamard product of two matrices (or vectors) of the same dimension, \(A = (a_{i,j})_{i \in [n], j \in [m]}, B = (b_{i,j})_{i\in [n], j \in [m]} \in \mathbb{R}^{n \times m}\), is defined as the element-wise product

\[ A \odot B = (a_{i,j} b_{i,j})_{i\in [n], j \in [m]}. \]

Similarly the Hadamard division is defined as the element-wise division

\[ A \oslash B = (a_{i,j} / b_{i,j})_{i\in [n], j \in [m]} \]

where we assume that \(b_{i,j} \neq 0\) for all \(i,j\).

EXAMPLE: As an illustrative example,

\[\begin{split} \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \odot \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} = \begin{bmatrix} 1 \times 0 & 2 \times 5\\ 3 \times 6 & 4 \times 7\\ \end{bmatrix} = \begin{bmatrix} 0 & 10\\ 18 & 28\\ \end{bmatrix}. \end{split}\]

\(\lhd\)

Recall that \(\mathbf{1}\) is the all-one vector and that, for \(\mathbf{x} = (x_1,\ldots,x_n) \in \mathbb{R}^n\), \(\mathrm{diag}(\mathbf{x}) \in \mathbb{R}^{n \times n}\) is the diagonal matrix with diagonal entries \(x_1,\ldots,x_n\).

LEMMA (Properties of the Hadamard Product) \(\idx{properties of the Hadamard product}\xdi\) Let \(\mathbf{a} = (a_1,\ldots,a_n), \mathbf{b} = (b_1,\ldots,b_n), \mathbf{c} = (c_1,\ldots,c_n) \in \mathbb{R}^n\). Then the following hold:

a) \(\mathrm{diag}(\mathbf{a}) \, \mathbf{b} = \mathrm{diag}(\mathbf{a} \odot \mathbf{b})\);

b) \(\mathbf{a}^T(\mathbf{b} \odot \mathbf{c}) = \mathbf{1}^T(\mathbf{a} \odot \mathbf{b} \odot \mathbf{c})\)

and, provided \(a_i \neq 0\) for all \(i\), the following hold as well:

c) \(\mathrm{diag}(\mathbf{a}) \, (\mathbf{b} \oslash \mathbf{a}) = \mathrm{diag}(\mathbf{b})\);

d) \(\mathbf{a}^T \, (\mathbf{b} \oslash \mathbf{a}) = \mathbf{1}^T \mathbf{b}\).

\(\flat\)

Proof: a) The product of a diagonal matrix and a vector produces a new vector whose original coordinates are multiplied by the corresponding diagonal entry. That proves the first claim.

b) We have

\[ \mathbf{a}^T \, (\mathbf{b} \odot \mathbf{c}) = \sum_{i=1}^n a_i (b_i c_i) = \mathbf{1}^T (\mathbf{a} \odot \mathbf{b} \odot \mathbf{c}). \]

c) and d) follow respectively from a) and b).

\(\square\)

Second, we introduce the Kronecker product\(\idx{Kronecker product}\xdi\). Let \(A = (a_{i,j})_{i \in [n], j \in [m]} \in \mathbb{R}^{n \times m}\) and \(B = (b_{i,j})_{i \in [p], j \in [q]} \in \mathbb{R}^{p \times q}\) be arbitrary matrices. Their Kronecker product, denoted \(A \otimes B \in \mathbb{R}^{np \times mq}\), is the following matrix in block form

\[\begin{split} A \otimes B = \begin{pmatrix} a_{1,1} B & \cdots & a_{1,m} B \\ \vdots & \ddots & \vdots \\ a_{n,1} B & \cdots & a_{n,m} B \end{pmatrix}. \end{split}\]

EXAMPLE: Here is a simple illustrative example from Wikipedia:

\[\begin{split} \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \otimes \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} = \begin{bmatrix} 1 \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} & 2 \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} \\ 3 \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} & 4 \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} \\ \end{bmatrix} = \begin{bmatrix} 0 & 5 & 0 & 10 \\ 6 & 7 & 12 & 14 \\ 0 & 15 & 0 & 20 \\ 18 & 21 & 24 & 28 \end{bmatrix}. \end{split}\]

\(\lhd\)

EXAMPLE: (Outer product) \(\idx{outer product}\xdi\) Here is another example we have encoutered previously, the outer product of two vectors \(\mathbf{u} = (u_1,\ldots,u_n) \in \mathbb{R}^n\) and \(\mathbf{v} = (v_1,\ldots, v_m) \in \mathbb{R}^m\). Recall that the outer product is defined in block form as the \(n \times m\) matrix

\[ \mathbf{u} \mathbf{v}^T = \begin{pmatrix} v_1 \mathbf{u} & \cdots & v_m \mathbf{u} \end{pmatrix} = \mathbf{v}^T \otimes \mathbf{u}. \]

Equivalently,

\[\begin{split} \mathbf{u} \mathbf{v}^T = \begin{pmatrix} u_1 \mathbf{v}^T\\ \vdots\\ u_n \mathbf{v}^T \end{pmatrix} = \mathbf{u} \otimes \mathbf{v}^T. \end{split}\]

\(\lhd\)

EXAMPLE: (continued) In the previous example the Kronecker product turned out to be commutative (i.e., we had \(\mathbf{v}^T \otimes \mathbf{u} = \mathbf{u} \otimes \mathbf{v}^T\)). This is not the case in general. Going back to the first example above, note that

\[\begin{split} \begin{bmatrix} 0 & 5 \\ 6 & 7 \\ \end{bmatrix} \otimes \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} = \begin{bmatrix} 0 \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} & 5 \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \\ 6 \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} & 7 \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \\ \end{bmatrix} = \begin{bmatrix} 0 & 0 & 5 & 10 \\ 0 & 0 & 15 & 20 \\ 6 & 12 & 7 & 14 \\ 18 & 24 & 21 & 28 \end{bmatrix}. \end{split}\]

You can check that this is different from what we obtained in the opposite order. \(\lhd\)

The proof of the following lemma is left as an exercise.

LEMMA (Properties of the Kronecker Product) \(\idx{properties of the Kronecker Product}\xdi\) The Kronecker product has the following properties:

a) If \(B, C\) are matrices of the same dimension

\[ A \otimes (B + C) = A \otimes B + A \otimes C \quad \text{and}\quad (B + C) \otimes A = B \otimes A + C \otimes A. \]

b) If \(A, B, C, D\) are matrices of such size that one can form the matrix products \(AC\) and \(BD\), then

\[ (A \otimes B)\,(C \otimes D) = (AC) \otimes (BD). \]

c) If \(A, C\) are matrices of the same dimension and \(B, D\) are matrices of the same dimension, then

\[ (A \otimes B)\odot(C \otimes D) = (A\odot C) \otimes (B\odot D). \]

d) If \(A,B\) are invertible, then

\[ (A \otimes B)^{-1} = A^{-1} \otimes B^{-1}. \]

e) The transpose of \(A \otimes B\) is

\[ (A \otimes B)^T = A^T \otimes B^T. \]

f) If \(\mathbf{u}\) is a column vector and \(A, B\) are matrices of such size that one can form the matrix product \(AB\), then

\[ (\mathbf{u} \otimes A) B = \mathbf{u} \otimes (AB) \quad\text{and}\quad (A \otimes \mathbf{u}) B = (AB) \otimes \mathbf{u}. \]

Similarly,

\[ A \,(\mathbf{u}^T \otimes B) = \mathbf{u}^T \otimes (AB) \quad\text{and}\quad A \,(B \otimes \mathbf{u}^T) = (AB) \otimes \mathbf{u}^T. \]

\(\flat\)

8.2.2. Jacobian#

Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable. A different way to put this is that \(f'(x)\) is the slope of the tangent line to \(f\) at \(x\). Formally, one can approximate \(f(x)\) by a linear function in the neighborhood of \(x\) as follows

\[ f(x + h) = f(x) + f'(x) h + r(h), \]

where \(r(h)\) is negligible compared to \(h\) in the sense that

\[ \lim_{h\to 0} \frac{r(h)}{h} = 0. \]

Indeed, define

\[ r(h) = f(x + h) - f(x) - f'(x) h. \]

Then by definition of the derivative

\[ \lim_{h\to 0} \frac{r(h)}{h} = \lim_{h\to 0} \frac{f(x + h) - f(x) - f'(x) h}{h} = \lim_{h\to 0} \left[\frac{f(x + h) - f(x)}{h} - f'(x) \right] = 0. \]

For vector-valued functions, we have the following generalization. Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x} \in D\) be an interior point of \(D\). We say that \(\mathbf{f}\) is diffentiable\(\idx{diffentiable}\xdi\) at \(\mathbf{x}\) if there exists a matrix \(A \in \mathbb{R}^{m \times d}\) such that

\[ \mathbf{f}(\mathbf{x}+\mathbf{h}) = \mathbf{f}(\mathbf{x}) + A \mathbf{h} + \mathbf{r}(\mathbf{h}) \]

where

\[ \lim_{\mathbf{h} \to 0} \frac{\|\mathbf{r}(\mathbf{h})\|_2}{\|\mathbf{h}\|_2} = 0. \]

The matrix \(\mathbf{f}'(\mathbf{x}) = A\) is called the differential\(\idx{differential}\xdi\) of \(\mathbf{f}\) at \(\mathbf{x}\), and we see that the affine map \(\mathbf{f}(\mathbf{x}) + A \mathbf{h}\) provides an approximation of \(\mathbf{f}\) in the neighborhood of \(\mathbf{x}\).

We will not derive the full theory here. In the case where each component of \(\mathbf{f}\) has continuous partial derivatives in a neighborhood of \(\mathbf{x}\), then the differential exists and is equal to the Jacobian, as defined next.

DEFINITION (Jacobian) \(\idx{Jacobian}\xdi\) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\) where \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists for all \(i, j\). The Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\) is the \(m \times d\) matrix

\[\begin{split} J_{\mathbf{f}}(\mathbf{x}_0) = \begin{pmatrix} \frac{\partial f_1 (\mathbf{x}_0)}{\partial x_1} & \ldots & \frac{\partial f_1 (\mathbf{x}_0)}{\partial x_d}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_m (\mathbf{x}_0)}{\partial x_1} & \ldots & \frac{\partial f_m (\mathbf{x}_0)}{\partial x_d} \end{pmatrix}. \end{split}\]

\(\natural\)

THEOREM (Differential and Jacobian) \(\idx{differential and Jacobian theorem}\xdi\) Let \(\mathbf{f} = (f_1, \ldots, f_m) : D \to \mathbb{R}^m\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume that \(\frac{\partial f_j (\mathbf{x}_0)}{\partial x_i}\) exists and is continous is an open ball around \(\mathbf{x}_0\) for all \(i, j\). Then the differential at \(\mathbf{x}_0\) is equal to the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\). \(\sharp\)

Recall that for any \(A, B\) for which \(AB\) is well-defined it holds that \(\|A B \|_F \leq \|A\|_F \|B\|_F\). This applies in particular when \(B\) is a column vector, in which case \(\|B\|_F\) is its Euclidean norm.

Proof: By the Mean Value Theorem, for each \(i\), there is \(\xi_{\mathbf{h},i} \in (0,1)\) such that

\[ f_i(\mathbf{x}_0+\mathbf{h}) = f_i(\mathbf{x}_0) + \nabla f_i(\mathbf{x}_0 + \xi_{\mathbf{h},i} \mathbf{h})^T \mathbf{h}. \]

Define

\[\begin{split} \tilde{J}(\mathbf{h}) = \begin{pmatrix} \frac{\partial f_1 (\mathbf{x}_0 + \xi_{\mathbf{h},1} \mathbf{h})}{\partial x_1} & \ldots & \frac{\partial f_1 (\mathbf{x}_0 + \xi_{\mathbf{h},1} \mathbf{h})}{\partial x_d}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_m (\mathbf{x}_0 + \xi_{\mathbf{h},m} \mathbf{h})}{\partial x_1} & \ldots & \frac{\partial f_m (\mathbf{x}_0 + \xi_{\mathbf{h},m} \mathbf{h})}{\partial x_d} \end{pmatrix}. \end{split}\]

Hence we have

\[ \mathbf{f}(\mathbf{x}_0+\mathbf{h}) - \mathbf{f}(\mathbf{x}_0) - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h} = \tilde{J}(\mathbf{h}) \,\mathbf{h} - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h} = \left(\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\right)\,\mathbf{h}. \]

Taking a limit as \(\mathbf{h}\) goes to \(0\), we get

\[\begin{align*} \lim_{\mathbf{h} \to 0} \frac{\|\mathbf{f}(\mathbf{x}_0+\mathbf{h}) - \mathbf{f}(\mathbf{x}_0) - J_{\mathbf{f}}(\mathbf{x}_0)\,\mathbf{h}\|_2}{\|\mathbf{h}\|_2} &= \lim_{\mathbf{h} \to 0} \frac{\|\left(\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\right)\,\mathbf{h}\|_2}{\|\mathbf{h}\|_2}\\ &\leq \lim_{\mathbf{h} \to 0} \frac{\|\tilde{J}(\mathbf{h}) - J_{\mathbf{f}}(\mathbf{x}_0)\|_F \|\mathbf{h}\|_2}{\|\mathbf{h}\|_2}\\ &= 0, \end{align*}\]

by continuity of the partial derivatives. \(\square\)

EXAMPLE: An example of a vector-valued function is

\[\begin{split} \mathbf{g}(x_1,x_2) = \begin{pmatrix} g_1(x_1,x_2)\\ g_2(x_1,x_2)\\ g_3(x_1,x_2) \end{pmatrix} = \begin{pmatrix} 3 x_1^2\\ x_2\\ x_1 x_2 \end{pmatrix}. \end{split}\]

Its Jacobian is

\[\begin{split} J_{\mathbf{g}}(x_1, x_2) = \begin{pmatrix} \frac{\partial g_1 (x_1, x_2)}{\partial x_1} & \frac{\partial g_1 (x_1, x_2)}{\partial x_2}\\ \frac{\partial g_2 (x_1, x_2)}{\partial x_1} & \frac{\partial g_2 (x_1, x_2)}{\partial x_2}\\ \frac{\partial g_3 (x_1, x_2)}{\partial x_1} & \frac{\partial g_3 (x_1, x_2)}{\partial x_2} \end{pmatrix} = \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}. \end{split}\]

\(\lhd\)

EXAMPLE: (Gradient and Jacobian) For a continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian reduces to the row vector

\[ J_{f}(\mathbf{x}_0) = \left(\frac{\partial f (\mathbf{x}_0)}{\partial x_1}, \ldots, \frac{\partial f (\mathbf{x}_0)}{\partial x_d}\right)^T = \nabla f(\mathbf{x}_0)^T \]

where \(\nabla f(\mathbf{x}_0)\) is the gradient of \(f\) at \(\mathbf{x}_0\). \(\lhd\)

EXAMPLE: (Hessian and Jacobian) For a twice continuously differentiable real-valued function \(f : D \to \mathbb{R}\), the Jacobian of its gradient is

\[\begin{split} J_{\nabla f}(\mathbf{x}_0) = \begin{pmatrix} \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1^2} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d \partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_1 \partial x_d} & \cdots & \frac{\partial^2 f(\mathbf{x}_0)}{\partial x_d^2} \end{pmatrix}, \end{split}\]

that is, the Hessian (tranposed, but that makes no difference; why?) of \(f\) at \(\mathbf{x}_0\). \(\lhd\)

EXAMPLE: (Parametric Curve and Jacobian) Consider the parametric curve \(\mathbf{g}(t) = (g_1(t), \ldots, g_d(t)) \in \mathbb{R}^d\) for \(t\) in some closed interval of \(\mathbb{R}\). Assume that \(\mathbf{g}(t)\) is continuously differentiable at \(t\), that is, each of its component is.

Then

\[\begin{split} J_{\mathbf{g}}(t) = \begin{pmatrix} g_1'(t)\\ \vdots\\ g_m'(t) \end{pmatrix} = \mathbf{g}'(t). \end{split}\]

\(\lhd\)

EXAMPLE: (Affine Map) Let \(A = (a_{i,j})_{i,j} \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m) \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as

\[ \mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}. \]

This is an affine map. Note in particular that, in the case \(\mathbf{b} = \mathbf{0}\) of a linear map,

\[ \mathbf{f}(\mathbf{x} + \mathbf{y}) = A(\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y} = \mathbf{f}(\mathbf{x}) + \mathbf{f}(\mathbf{y}). \]

Denote the rows of \(A\) by \(\boldsymbol{\alpha}_1^T,\ldots,\boldsymbol{\alpha}_m^T\).

We compute the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}\). Note that

\[\begin{align*} \frac{\partial f_i (\mathbf{x})}{\partial x_j} &= \frac{\partial}{\partial x_j}[\boldsymbol{\alpha}_i^T \mathbf{x} + b_i]\\ &= \frac{\partial}{\partial x_j}\left[\sum_{\ell=1}^m a_{i,\ell} x_{\ell} + b_i\right]\\ &= a_{i,j}. \end{align*}\]

So

\[ J_{\mathbf{f}}(\mathbf{x}) = A. \]

\(\lhd\)

The following important example is a less straightforward application of the Jacobian.

It will be useful to introduce the vectorization \(\mathrm{vec}(A) \in \mathbb{R}^{nm}\) of a matrix \(A = (a_{i,j})_{i,j} \in \mathbb{R}^{n \times m}\) as the vector

\[ \mathrm{vec}(A) = (a_{1,1},\ldots,a_{n,1},a_{1,2},\ldots,a_{n,2},\ldots,a_{1,m},\ldots,a_{n,m}). \]

That is, it is obtained by stacking the columns of \(A\) on top of each other.

EXAMPLE: (Jacobian of a Linear Map with Respect to its Matrix) We take a different tack on the previous example. In data science applications, it will be useful to compute the Jacobian of a linear map \(X \mathbf{z}\) – with respect to the matrix \(X \in \mathbb{R}^{n \times m}\). Specifically, for a fixed \(\mathbf{z} \in \mathbb{R}^{m}\), letting \((\mathbf{x}^{(i)})^T\) be the \(i\)-th row of \(X\) we define the function

\[\begin{split} \mathbf{f}(\mathbf{x}) = X \mathbf{z} = \begin{pmatrix} (\mathbf{x}^{(1)})^T\\ \vdots\\ (\mathbf{x}^{(n)})^T \end{pmatrix} \mathbf{z} = \begin{pmatrix} (\mathbf{x}^{(1)})^T \mathbf{z} \\ \vdots\\ (\mathbf{x}^{(n)})^T \mathbf{z} \end{pmatrix} \end{split}\]

where \(\mathbf{x} = \mathrm{vec}(X^T) = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})\).

To compute the Jacobian, let us look at its columns that correspond to the variables in \(\mathbf{x}^{(k)}\), that is, columns \(\alpha_k = (k-1) m + 1\) to \(\beta_k = k m\). Note that only the \(k\)-th component of \(\mathbf{f}\) depends on \(\mathbf{x}^{(k)}\), so the rows \(\neq k\) of \(J_{\mathbf{f}}(\mathbf{x})\) are \(0\) for the corresponding columns.

Row \(k\) on the other hand is \(\mathbf{z}^T\) from our previous formula for the gradient of an affine map. Hence one way to write the columns \(\alpha_k\) to \(\beta_k\) of \(J_{\mathbf{f}}(\mathbf{x})\) is \(\mathbf{e}_k \mathbf{z}^T\), where here \(\mathbf{e}_k \in \mathbb{R}^{n}\) is the \(k\)-th standard basis vector of \(\mathbb{R}^{n}\).

So \(J_{\mathbf{f}}(\mathbf{x})\) can be written in block form as

\[ J_{\mathbf{f}}(\mathbf{x}) = \begin{pmatrix} \mathbf{e}_1 \mathbf{z}^T & \cdots & \mathbf{e}_{n}\mathbf{z}^T \end{pmatrix} = I_{n\times n} \otimes \mathbf{z}^T =: \mathbb{B}_{n}[\mathbf{z}], \]

where the last equality is a definition. \(\lhd\)

We will need one more wrinkle.

EXAMPLE: (Jacobian of a Linear Map with Respect to its Input and Matrix) Consider again the linear map \(X \mathbf{z}\) – this time as a function of both the matrix \(X \in \mathbb{R}^{n \times m}\) and the vector \(\mathbf{z} \in \mathbb{R}^{m}\). That is, letting again \((\mathbf{x}^{(i)})^T\) be the \(i\)-th row of \(X\), we define the function

\[\begin{split} \mathbf{g}(\mathbf{z}, \mathbf{x}) = X \mathbf{z} = \begin{pmatrix} (\mathbf{x}^{(1)})^T\\ \vdots\\ (\mathbf{x}^{(n)})^T \end{pmatrix} \mathbf{z} = \begin{pmatrix} (\mathbf{x}^{(1)})^T \mathbf{z} \\ \vdots\\ (\mathbf{x}^{(n)})^T \mathbf{z} \end{pmatrix} \end{split}\]

where as before \(\mathbf{x} = \mathrm{vec}(X^T) = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})\).

To compute the Jacobian, we think of it as a block matrix and use the two previous examples. The columns of \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) corresponding to the variables in \(\mathbf{z}\), that is, columns \(1\) to \(m\), are

\[\begin{split} X = \begin{pmatrix} (\mathbf{x}^{(1)})^T\\ \vdots\\ (\mathbf{x}^{(n)})^T \end{pmatrix} =: \mathbb{A}_{n}[\mathbf{x}]. \end{split}\]

The columns of \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) corresponding to the variables in \(\mathbf{x}\), that is, columns \(m + 1\) to \(m + nm\), are the matrix \(\mathbb{B}_{n}[\mathbf{z}]\). Note that, in both \(\mathbb{A}_{n}[\mathbf{x}]\) and \(\mathbb{B}_{n}[\mathbf{z}]\), the subscript \(n\) indicates the number of rows of the matrix. The number of columns is determined by \(n\) and the size of the input vector:

the length of \(\mathbf{x}\) divided by \(n\) for \(\mathbb{A}_{n}[\mathbf{x}]\);
the length of \(\mathbf{z}\) multiplied by \(n\) for \(\mathbb{B}_{n}[\mathbf{z}]\).

So \(J_{\mathbf{f}}(\mathbf{z}, \mathbf{x})\) can be written in block form as

\[ J_{\mathbf{f}}(\mathbf{z}, \mathbf{x}) = \begin{pmatrix} \mathbb{A}_{n}[\mathbf{x}] & \mathbb{B}_{n}[\mathbf{z}] \end{pmatrix}. \]

\(\lhd\)

EXAMPLE: (Elementwise Function) Let \(f : D \to \mathbb{R}\), with \(D \subseteq \mathbb{R}\), be a continuously differentiable real-valued function of a single variable. For \(n \geq 2\), consider applying \(f\) to each entry of a vector \(\mathbf{x} \in \mathbb{R}^n\), that is, let \(\mathbf{f} : D^n \to \mathbb{R}^n\) with

\[ \mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_n(\mathbf{x})) = (f(x_1), \ldots, f(x_n)). \]

The Jacobian of \(\mathbf{f}\) can be computed from \(f'\), the derivative of the single-variable case. Indeed, letting \(\mathbf{x} = (x_1,\ldots,x_n)\) be such that \(x_i\) is an interior point of \(D\) for all \(i\),

\[ \frac{\partial f_j(\mathbf{x})}{\partial x_j} = f'(x_j), \]

while for \(\ell \neq j\)

\[ \frac{\partial f_\ell(\mathbf{x})}{\partial x_j} =0, \]

as \(f_\ell(\mathbf{x})\) does not in fact depend on \(x_j\). In other words, the \(j\)-th column of the Jacobian is \(f'(x_j) \,\mathbf{e}_j\), where again \(\mathbf{e}_{j}\) is the \(j\)-th standard basis vector in \(\mathbb{R}^{n}\).

So \(J_{\mathbf{f}}(\mathbf{x})\) is the diagonal matrix with diagonal entries \(f'(x_j)\), \(j=1, \ldots, n\), which we denote by

\[ J_{\mathbf{f}}(\mathbf{x}) = \mathrm{diag}(f'(x_1),\ldots,f'(x_n)). \]

\(\lhd\)

8.2.3. Generalization of the Chain Rule#

As we have seen, functions are often obtained from the composition of simpler ones. We will use the vector notation \(\mathbf{h} = \mathbf{g} \circ \mathbf{f}\) for the function \(\mathbf{h}(\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x}))\).

LEMMA (Composition of Continuous Functions) \(\idx{composition of continuous functions lemma}\xdi\) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuous at \(\mathbf{x}_0\) and that \(\mathbf{g}\) is continuous \(\mathbf{f}(\mathbf{x}_0)\). Then \(\mathbf{g} \circ \mathbf{f}\) is continuous at \(\mathbf{x}_0\). \(\flat\)

The Chain Rule gives a formula for the Jacobian of a composition.

THEOREM (Chain Rule) \(\idx{chain rule}\xdi\) Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then

\[ J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0)) \,J_{\mathbf{f}}(\mathbf{x}_0) \]

as a product of matrices. \(\sharp\)

Intuitively, the Jacobian provides a linear approximation of the function in the neighborhood of a point. The composition of linear maps corresponds to the product of the associated matrices. Similarly, the Jacobian of a composition is the product of the Jacobians.

Proof: To avoid confusion, we think of \(\mathbf{f}\) and \(\mathbf{g}\) as being functions of variables with different names, specifically,

\[ \mathbf{f}(\mathbf{x}) = (f_1(x_1,\ldots,x_d),\ldots,f_m(x_1,\ldots,x_d)) \]

and

\[ \mathbf{g}(\mathbf{y}) = (g_1(y_1,\ldots,y_m),\ldots,g_p(y_1,\ldots,y_m)). \]

We apply the Chain Rule for a real-valued function over a parametric vector curve. That is, we think of

\[ h_i(\mathbf{x}) = g_i(\mathbf{f}(\mathbf{x})) = g_i(f_1(x_1,\ldots,x_j,\ldots,x_d),\ldots,f_m(x_1,\ldots,x_j,\ldots,x_d)) \]

as a function of \(x_j\) only with all other \(x_i\)s fixed.

We get that

\[ \frac{\partial h_i(\mathbf{x}_0)}{\partial x_j} = \sum_{k=1}^m \frac{\partial g_i(\mathbf{f}(\mathbf{x}_0))} {\partial y_k} \frac{\partial f_k(\mathbf{x}_0)}{\partial x_j} \]

where, as before, the notation \(\frac{\partial g_i} {\partial y_k}\) indicates the partial derivative of \(g_i\) with respect to its \(k\)-th component. In matrix form, the claim follows. \(\square\)

EXAMPLE: (Affine Map continued) Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define again the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). In addition, for \(C \in \mathbb{R}^{p \times m}\) and \(\mathbf{d} \in \mathbb{R}^{p}\), define \(\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^p\) as \(\mathbf{g}(\mathbf{y}) = C \mathbf{y} + \mathbf{d}\).

Then

\[ J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x})) \,J_{\mathbf{f}}(\mathbf{x}) = C A, \]

for all \(\mathbf{x} \in \mathbb{R}^d\).

This is consistent with the observation that

\[ \mathbf{g} \circ \mathbf{f} (\mathbf{x}) = \mathbf{g} (\mathbf{f} (\mathbf{x})) = C( A\mathbf{x} + \mathbf{b} ) + \mathbf{d} = CA \mathbf{x} + (C\mathbf{b} + \mathbf{d}). \]

\(\lhd\)

EXAMPLE: Suppose we want to compute the gradient of the function

\[ f(x_1, x_2) = 3 x_1^2 + x_2 + \exp(x_1 x_2). \]

We could apply the Chain Rule directly, but to illustrate the perspective that is coming up, we think of \(f\) as a composition of “simpler” vector-valued functions. Specifically, let

\[\begin{split} \mathbf{g}(x_1,x_2) = \begin{pmatrix} 3 x_1^2\\ x_2\\ x_1 x_2 \end{pmatrix} \qquad h(y_1,y_2,y_3) = y_1 + y_2 + \exp(y_3). \end{split}\]

Then \(f(x_1, x_2) = h(\mathbf{g}(x_1, x_2)) = h \circ \mathbf{g}(x_1, x_2)\).

By the Chain Rule, we can compute the gradient of \(f\) by first computing the Jacobians of \(\mathbf{g}\) and \(h\). We have already computed the Jacobian of \(\mathbf{g}\)

\[\begin{split} J_{\mathbf{g}}(x_1, x_2) = \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}. \end{split}\]

The Jacobian of \(h\) is

\[ J_h(y_1, y_2, y_3) = \begin{pmatrix} \frac{\partial h(y_1, y_2, y_3)}{\partial y_1} & \frac{\partial h(y_1, y_2, y_3)}{\partial y_2} & \frac{\partial h(y_1, y_2, y_3)}{\partial y_3} \end{pmatrix} = \begin{pmatrix} 1 & 1 & \exp(y_3) \end{pmatrix}. \]

The Chain Rule stipulates that

\[\begin{align*} \nabla f(x_1, x_2)^T &= J_f(x_1, x_2)\\ &= J_h(\mathbf{g}(x_1,x_2)) \, J_{\mathbf{g}}(x_1, x_2)\\ &= \begin{pmatrix} 1 & 1 & \exp(g_3(x_1, x_2)) \end{pmatrix} \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}\\ &= \begin{pmatrix} 1 & 1 & \exp(x_1 x_2) \end{pmatrix} \begin{pmatrix} 6 x_1 & 0\\ 0 & 1\\ x_2 & x_1 \end{pmatrix}\\ &= \begin{pmatrix} 6 x_1 + x_2 \exp(x_1 x_2) & 1 + x_1 \exp(x_1 x_2) \end{pmatrix}. \end{align*}\]

You can check directly (i.e., without the composition) that this is indeed the correct gradient (transposed).

Alternatively, it is instructive to “expand” the Chain Rule as we did in its proof. Specifically,

\[\begin{align*} \frac{\partial f (x_1, x_2)}{\partial x_1} &= \sum_{i=1}^3 \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_i} \frac{\partial g_i (x_1, x_2)}{\partial x_1}\\ &= \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_1} \frac{\partial g_1 (x_1, x_2)}{\partial x_1} + \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_2} \frac{\partial g_2 (x_1, x_2)}{\partial x_1} + \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_3} \frac{\partial g_3 (x_1, x_2)}{\partial x_1}\\ &= 1 \cdot 6x_1 + 1 \cdot 0 + \exp(g_3(x_1, x_2)) \cdot x_2\\ &= 6 x_1 + x_2 \exp(x_1 x_2). \end{align*}\]

Note that this corresponds to multiplying \(J_h(\mathbf{g}(x_1,x_2))\) by the first column of \(J_{\mathbf{g}}(x_1, x_2)\).

Similarly

\[\begin{align*} \frac{\partial f (x_1, x_2)}{\partial x_2} &= \sum_{i=1}^3 \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_i} \frac{\partial g_i (x_1, x_2)}{\partial x_2}\\ &= \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_1} \frac{\partial g_1 (x_1, x_2)}{\partial x_2} + \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_2} \frac{\partial g_2 (x_1, x_2)}{\partial x_2} + \frac{\partial h(\mathbf{g}(x_1,x_2))}{\partial y_3} \frac{\partial g_3 (x_1, x_2)}{\partial x_2}\\ &= 1 \cdot 0 + 1 \cdot 1 + \exp(g_3(x_1, x_2)) \cdot x_1\\ &= 1 + x_1 \exp(x_1 x_2). \end{align*}\]

This corresponds to multiplying \(J_h(\mathbf{g}(x_1,x_2))\) by the second column of \(J_{\mathbf{g}}(x_1, x_2)\). \(\lhd\)

CHAT & LEARN The Jacobian determinant has important applications in change of variables for multivariable integrals. Ask your favorite AI chatbot to explain this application and provide an example of using the Jacobian determinant in a change of variables for a double integral. \(\ddagger\)

8.2.4. Brief introduction to automatic differentiation in PyTorch#

We illustrate the use of automatic differentiation\(\idx{automatic differentiation}\xdi\) to compute gradients in PyTorch.

Quoting Wikipedia:

In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.

Automatic differentiation in PyTorch We will use PyTorch. It uses tensors\(\idx{tensor}\xdi\), which in many ways behave similarly to NumPy arrays. See here for a quick introduction. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option requires_grad=True, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process. The function .backward() computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad.

NUMERICAL CORNER: This is better understood through an example.

x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)

We define the function. Note that we use torch.exp, the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of ** for taking a power. Here is a list of operations on tensors in PyTorch.

f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)

f.backward()

print(x1.grad)  # df/dx1
print(x2.grad)  # df/dx2

tensor(20.7781)
tensor(8.3891)

The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use torch.sum for taking a sum of the arguments.

z = torch.tensor([1., 2., 3.], requires_grad=True)

g = torch.sum(z ** 2)
g.backward()

print(z.grad)  # gradient is (2 z_1, 2 z_2, 2 z_3)

tensor([2., 4., 6.])

Here is another typical example in a data science context.

X = torch.randn(3, 2)  # Random dataset (features)
y = torch.tensor([[1., 0., 1.]])  # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True)  # Parameter assignment

predict = X @ theta  # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2)  # Loss function
loss.backward()  # Compute gradients

print(theta.grad)  # gradient of loss

tensor([[27.3526],
        [25.9923]])

CHAT & LEARN Ask your favorite AI chatbot to explain how to compute a second derivative using PyTorch (it’s bit tricky). Ask for code that you can apply to the previous examples. (Open In Colab) \(\ddagger\)

\(\unlhd\)

Implementing gradient descent in PyTorch Rather than explicitly specifying the gradient function, we could use PyTorch to compute it automatically. This is done next. Note that the descent update is done within with torch.no_grad(), which ensures that the update operation itself is not tracked for gradient computation. Here the input x0 as well as the output xk.numpy(force=True) are NumPy arrays. The function torch.Tensor.numpy() converts a PyTorch tensor to a NumPy array (see the documentation for an explanation of the force=True option). Also, quoting ChatGPT:

In the given code, .item() is used to extract the scalar value from a tensor. In PyTorch, when you perform operations on tensors, you get back tensors as results, even if the result is a single scalar value. .item() is used to extract this scalar value from the tensor.

def gd_with_ad(f, x0, alpha=1e-3, niters=int(1e6)):
    xk = torch.tensor(x0, requires_grad=True, dtype=torch.float)
    
    for _ in range(niters):
        value = f(xk)
        value.backward()

        with torch.no_grad():  
            xk -= alpha * xk.grad

        xk.grad.zero_()

    return xk.numpy(force=True), f(xk).item()

NUMERICAL CORNER: We revisit a previous example.

def f(x):
    return x**3

print(gd_with_ad(f, 2, niters=int(1e4)))

(array(0.03277362, dtype=float32), 3.5202472645323724e-05)

print(gd_with_ad(f, -2, niters=100))

(array(-4.9335055, dtype=float32), -120.07894897460938)

\(\unlhd\)

CHAT & LEARN The section briefly mentions that automatic differentiation is distinct from symbolic differentiation and numerical differentiation. Ask your favorite AI chatbot to explain in more detail the differences between these three methods of computing derivatives. \(\ddagger\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 Let \(A \in \mathbb{R}^{n \times m}\) and \(B \in \mathbb{R}^{p \times q}\). What are the dimensions of the Kronecker product \(A \otimes B\)?

a) \(n \times m\)

b) \(p \times q\)

c) \(np \times mq\)

d) \(nq \times mp\)

2 If \(f: \mathbb{R}^d \rightarrow \mathbb{R}^m\) is a continuously differentiable function, what is its Jacobian \(J_f(x_0)\) at an interior point \(x_0\) of its domain?

a) A scalar representing the rate of change of \(f\) at \(x_0\).

b) A vector in \(\mathbb{R}^m\) representing the direction of steepest ascent of \(f\) at \(x_0\).

c) An \(m \times d\) matrix of partial derivatives of the component functions of \(f\) at \(x_0\).

d) The Hessian matrix of \(f\) at \(x_0\).

3 In the context of the Chain Rule, if \(f: \mathbb{R}^2 \to \mathbb{R}^3\) and \(g: \mathbb{R}^3 \to \mathbb{R}\), what is the dimension of the Jacobian matrix \(J_{g \circ f}(x)\)?

a) \(3 \times 2\)

b) \(1 \times 3\)

c) \(2 \times 3\)

d) \(1 \times 2\)

4 Let \(\mathbf{f} : D_1 \to \mathbb{R}^m\), where \(D_1 \subseteq \mathbb{R}^d\), and let \(\mathbf{g} : D_2 \to \mathbb{R}^p\), where \(D_2 \subseteq \mathbb{R}^m\). Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Which of the following is correct according to the Chain Rule?

a) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{f}}(\mathbf{x}_0) \, J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0))\)

b) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0)) \, J_{\mathbf{f}}(\mathbf{x}_0)\)

c) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{x}_0)) \, J_{\mathbf{g}}(\mathbf{x}_0)\)

d) \(J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{x}_0) \, J_{\mathbf{f}}(\mathbf{g}(\mathbf{x}_0))\)

5 Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). What is the Jacobian of \(\mathbf{f}\) at \(\mathbf{x}_0\)?

a) \(J_{\mathbf{f}}(\mathbf{x}_0) = A^T\)

b) \(J_{\mathbf{f}}(\mathbf{x}_0) = A \mathbf{x}_0 + \mathbf{b}\)

c) \(J_{\mathbf{f}}(\mathbf{x}_0) = A\)

d) \(J_{\mathbf{f}}(\mathbf{x}_0) = \mathbf{b}\)

Answer for 1: c. Justification: The text defines the Kronecker product as a matrix in block form with dimensions \(np \times mq\).

Answer for 2: c. Justification: The text defines the Jacobian of a vector-valued function as a matrix of partial derivatives.

Answer for 3: d. Justification: The composition \(g \circ f\) maps \(\mathbb{R}^2 \to \mathbb{R}\), hence the Jacobian matrix \(J_{g \circ f}(x)\) is \(1 \times 2\).

Answer for 4: b. Justification: The text states: “The Chain Rule gives a formula for the Jacobian of a composition. […] Assume that \(\mathbf{f}\) is continuously differentiable at \(\mathbf{x}_0\), an interior point of \(D_1\), and that \(\mathbf{g}\) is continuously differentiable at \(\mathbf{f}(\mathbf{x}_0)\), an interior point of \(D_2\). Then

\[ J_{\mathbf{g} \circ \mathbf{f}}(\mathbf{x}_0) = J_{\mathbf{g}}(\mathbf{f}(\mathbf{x}_0)) \,J_{\mathbf{f}}(\mathbf{x}_0) \]

as a product of matrices.”

Answer for 5: c. Justification: The text states: “Let \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1,\ldots,b_m) \in \mathbb{R}^{m}\). Define the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_m) : \mathbb{R}^d \to \mathbb{R}^m\) as \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). […] So \(J_{\mathbf{f}}(\mathbf{x}) = A.\)”

Background: Jacobian, chain rule, and a brief introduction to automatic differentiation

Contents

8.2. Background: Jacobian, chain rule, and a brief introduction to automatic differentiation#

8.2.1. More matrix algebra: Hadamard and Kronecker products#

8.2.2. Jacobian#

8.2.3. Generalization of the Chain Rule#

8.2.4. Brief introduction to automatic differentiation in PyTorch#