\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

8.6. Exercises#

8.6.1. Warm-up worksheets#

(with help from Claude, Gemini, and ChatGPT)

Section 8.2

E8.2.1 Let \(A = \begin{pmatrix} 2 & 1 \\ 0 & -1 \end{pmatrix}\). Compute the vectorization \(\text{vec}(A)\).

E8.2.2 Let \(a = (2, -1, 3)\) and \(b = (1, 0, -2)\). Compute the Hadamard product \(a \odot b\).

E8.2.3 Let \(A = \begin{pmatrix} 1 & 2 \\ -1 & 0 \end{pmatrix}\) and \(B = \begin{pmatrix} 3 & -1 \\ 2 & 1 \end{pmatrix}\). Compute the Kronecker product \(A \otimes B\).

E8.2.4 Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and \(B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}\). Compute the Hadamard product \(A \odot B\).

E8.2.5 Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and \(B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}\). Compute the Kronecker product \(A \otimes B\).

E8.2.6 Let \(\mathbf{f}(x, y) = (x^2y, \sin(xy), e^{x+y})\). Compute the Jacobian matrix of \(\mathbf{f}\) at the point \((1, \frac{\pi}{2})\).

E8.2.7 Let \(\mathbf{f}(x, y) = (x^2 + y^2, xy)\) and \(\mathbf{g}(u, v) = (uv, u + v)\). Compute the Jacobian matrix of the composition \(\mathbf{g} \circ \mathbf{f}\) at the point \((1, 2)\).

E8.2.8 Let \(\mathbf{a} = (1, 2, 3)\), \(\mathbf{b} = (4, 5, 6)\), and \(\mathbf{c} = (7, 8, 9)\). Compute \(\mathbf{a}^T(\mathbf{b} \odot \mathbf{c})\) and \(\mathbf{1}^T(\mathbf{a} \odot \mathbf{b} \odot \mathbf{c})\) and verify that they are equal.

E8.2.9 Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and \(B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}\). Compute \((A \otimes B)^T\) and \(A^T \otimes B^T\) and verify that they are equal.

E8.2.10 Let \(g(x, y) = (x^2y, x + y)\). Compute the Jacobian matrix \(J_g(x, y)\).

E8.2.11 Let \(f(x, y, z) = x^2 + y^2 + z^2\). Compute the gradient of \(f\) at the point \((1, 2, 3)\).

E8.2.12 Let \(f(x) = 2x^3 - x\) and \(g(y) = y^2 + 1\). Compute the Jacobian of the composite function \(g \circ f\) at \(x = 1\).

E8.2.13 Let \(f(x, y) = xy\) and \(\mathbf{g}(x, y) = (x^2, y^2)\). Compute the Jacobian matrix of \(f \circ \mathbf{g}\) at the point \((1, 2)\).

E8.2.14 Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and \(\mathbf{b} = \begin{pmatrix} 5 \\ 6 \end{pmatrix}\). Define the function \(\mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}\). Compute the Jacobian matrix of \(\mathbf{f}\) at any point \(\mathbf{x} \in \mathbb{R}^2\).

E8.2.15 Let \(f(x) = \sin(x)\). Define the function \(\mathbf{g}(x, y, z) = (f(x), f(y), f(z))\). Compute the Jacobian matrix of \(\mathbf{g}\) at the point \((\frac{\pi}{2}, \frac{\pi}{4}, \frac{\pi}{6})\).

E8.2.16 Use PyTorch to find the gradient of \(f(x) = x^3 - 4x\) at \(x = 2\). Provide the PyTorch code and the result.

Section 8.3

E8.3.1 Let \(A = \begin{pmatrix} 1 & -1 \\ 0 & 2 \end{pmatrix}\) and \(B = \begin{pmatrix} 2 & 1 \\ -1 & 3 \end{pmatrix}\). How many elementary operations (additions and multiplications) does it take to compute \(AB\)?

E8.3.2 Let \(A = \begin{pmatrix} 1 & -1 \\ 0 & 2 \end{pmatrix}\) and \(v = (2, -1)\). How many elementary operations (additions and multiplications) does it take to compute \(Av\)?

E8.3.3 Let \(\ell(\hat{\mathbf{y}}) = \|\hat{\mathbf{y}}\|^2\) where \(\hat{\mathbf{y}} = (\hat{y}_1, \hat{y}_2)\). Compute \(J_{\ell}(\hat{\mathbf{y}})\).

E8.3.4 Let \(\mathbf{g}_0(\mathbf{z}_0) = \begin{pmatrix} 1 & 2 \\ -1 & 0 \end{pmatrix} \mathbf{z}_0\) and \(\ell(\hat{\mathbf{y}}) = \|\hat{\mathbf{y}}\|^2\). Compute \(\nabla f(\mathbf{x})\) where \(f(\mathbf{x}) = \ell(\mathbf{g}_0(\mathbf{x}))\) and \(\mathbf{x} = (1, -1)\).

E8.3.5 Let \(\mathbf{g}_0\) and \(\ell\) be as in E8.3.4. Let \(\mathbf{g}_1(\mathbf{z}_1) = \begin{pmatrix} -1 & 0 \\ 1 & 1 \end{pmatrix} \mathbf{z}_1\). Compute \(\nabla f(\mathbf{x})\) where \(f(\mathbf{x}) = \ell(\mathbf{g}_1(\mathbf{g}_0(\mathbf{x})))\) and \(\mathbf{x} = (1, -1)\) using the reverse mode.

E8.3.6 Let \(\mathbf{g}_0(\mathbf{x}, \mathbf{w}_0) = W_0 \mathbf{x}\) where \(W_0 = \begin{pmatrix} w_0 & w_1 \\ w_2 & w_3 \end{pmatrix}\) and \(\mathbf{w}_0 = (w_0, w_1, w_2, w_3)\). Let \(\mathbf{x} = (-1, 1)\) be fixed. Compute the Jacobian \(J_{\mathbf{g}_0}(\mathbf{x}, \mathbf{w}_0)\) with respect to \(\mathbf{w}_0\) only by directly computing the necessary partial derivatives (i.e., without using the formulas in the text), and then compare to the formulas in the text.

E8.3.7 Let \(g_1(\mathbf{z}_1, \mathbf{w}_1) = W_1 \mathbf{z}_1\) where \(W_1 = \begin{pmatrix} w_4 & w_5\end{pmatrix}\) and \(\mathbf{w}_1 = (w_4, w_5)\). Compute \(J_{g_1}(\mathbf{z}_1, \mathbf{w}_1)\) with respect to both \(\mathbf{z}_1\) and \(\mathbf{w}_1\) by directly computing the necessary partial derivatives (i.e., without using the formulas in the text), and then compare to the formulas in the text.

E8.3.8 Let \(h(\mathbf{w}) = g_1(\mathbf{g}_0(\mathbf{x}, \mathbf{w}_0), \mathbf{w}_1)\) where \(\mathbf{g}_0\) and \(g_1\) are as in E8.3.6 and E8.3.7 and \(\mathbf{w} = (\mathbf{w}_0, \mathbf{w}_1) = (w_0, w_1, w_2, w_3, w_4, w_5)\). Let \(\mathbf{x} = (-1, 1)\) be fixed. Compute \(J_h(\mathbf{w})\) by directly computing the necessary partial derivatives (i.e., without using the formulas in the text), and then compare to the formulas in the text.

E8.3.9 Let \(f(\mathbf{w}) = \ell(g_1(\mathbf{g}_0(\mathbf{x}, \mathbf{w}_0), \mathbf{w}_1))\) where \(\ell(\hat{y}) = \hat{y}^2\), \(\mathbf{g}_0\) and \(g_1\) are as in E8.3.6 and E8.3.7, and \(\mathbf{w} = (\mathbf{w}_0, \mathbf{w}_1) = (w_0, w_1, w_2, w_3, w_4, w_5)\). Let \(\mathbf{x} = (-1, 1)\) be fixed. Compute \(J_f(\mathbf{w})\) by directly computing the necessary partial derivatives (i.e., without using the formulas in the text), and then compare to the formulas in the text.

Section 8.4

E8.4.1 Given a dataset with 5 samples, compute the full gradient descent step and the expected SGD step with a batch size of 2. Assume that the individual sample gradients are: \(\nabla f_{x_1, y_1}(w) = (1, 2)\), \(\nabla f_{x_2, y_2}(w) = (-1, 1)\), \(\nabla f_{x_3, y_3}(w) = (0, -1)\), \(\nabla f_{x_4, y_4}(w) = (2, 0)\), and \(\nabla f_{x_5, y_5}(w) = (1, 1)\).

E8.4.2 Compute the softmax function \(\gamma(z)\) for \(z = (1, -2, 0, 3)\).

E8.4.3 Compute the Kullback-Leibler divergence between the probability distributions \(p = (0.2, 0.3, 0.5)\) and \(q = (0.1, 0.4, 0.5)\).

E8.4.4 In linear regression with a single feature, the loss function for a single sample \((x, y)\) is given by

\[ \ell(w, b; x, y) = (wx + b - y)^2. \]

Compute the gradients \(\frac{\partial \ell}{\partial w}\) and \(\frac{\partial \ell}{\partial b}\).

E8.4.5 Suppose we have a dataset with three samples: \((x_1, y_1) = (2, 3)\), \((x_2, y_2) = (-1, 0)\), and \((x_3, y_3) = (1, 2)\). We want to perform mini-batch SGD for linear regression with a batch size of 2. If the first mini-batch randomly selected is \(B = \{1, 3\}\), compute the SGD update for the parameters \(w\) and \(b\), assuming a learning rate of \(\alpha = 0.1\). The model is initialized at \(w = 1\) and \(b = 0\).

E8.4.6 For the linear regression problem with a single sample \((x, y) = ((1, 2), 3)\), compute the gradient of the loss function \(f(w) = (x^T w - y)^2\) at \(w = (0, 0)\).

E8.4.7 Consider the logistic regression loss function for a single sample \((x, y)\) where \(x \in \mathbb{R}\) and \(y \in \{0, 1\}\):

\[ \ell(w; x, y) = -y \log(\sigma(wx)) - (1-y) \log(1 - \sigma(wx)), \]

where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function. Compute the gradient \(\nabla \ell(w; x, y)\) with respect to \(w\).

E8.4.8 Consider a multinomial logistic regression problem with three classes (\(K = 3\)). Given an input vector \(x = (1, -1)\), and a weight matrix

\[\begin{split} W = \begin{pmatrix} 1 & 2 \\ -1 & 0 \\ 0 & 1 \end{pmatrix}, \end{split}\]

compute the softmax output \(\gamma(Wx)\).

E8.4.9 For the multinomial logistic regression problem with a single sample \((x, y) = ((1, 2), (0, 0, 1))\) and \(K = 3\) classes, compute the gradient of the loss function \(f(w) = -\sum_{i=1}^K y_i \log \gamma_i(Wx)\) at \(W = \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix}\).

E8.4.10 For the linear regression problem with two samples \((x_1, y_1) = ((1, 2), 3)\) and \((x_2, y_2) = ((4, -1), 2)\), compute the full gradient at \(w = (0, 0)\).

E8.4.11 For the multinomial logistic regression problem with two samples \((x_1, y_1) = ((1, 2), (0, 0, 1))\) and \((x_2, y_2) = ((4, -1), (1, 0, 0))\), and \(K = 3\) classes, compute the full gradient at \(W = \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix}\).

E8.4.12 In a binary classification problem, the logistic regression model predicts probabilities of 0.8 and 0.3 for two samples. If the true labels for these samples are 1 and 0, respectively, compute the average cross-entropy loss.

E8.4.13 In a multi-class classification problem with four classes, a model predicts the following probability distribution for a sample: \((0.1, 0.2, 0.3, 0.4)\). If the true label is the third class, what is the cross-entropy loss for this sample?

Section 8.5

E8.5.1 Compute the sigmoid function \(\sigma(t)\) for the following values of \(t\): \(1, -1, 2\).

E8.5.2 Compute the derivative of the sigmoid function \(\sigma'(t)\) for the following values of \(t\): \(1, -1, 2\).

E8.5.3 Given the vector \(\mathbf{z} = (1, -1, 2)\), compute \(\bsigma(\mathbf{z})\) and \(\bsigma'(\mathbf{z})\).

E8.5.4 Given the matrix \(W = \begin{pmatrix} 1 & -1 \\ 2 & 0 \end{pmatrix}\) and the vector \(\mathbf{x} = (1, 2)\), compute \(\bsigma(W\mathbf{x})\).

E8.5.5 Given the matrix \(W = \begin{pmatrix} 1 & -1 \\ 2 & 0 \end{pmatrix}\) and the vector \(\mathbf{x} = (1, 2)\), compute the Jacobian matrix of \(\bsigma(W\mathbf{x})\).

E8.5.6 Given the vectors \(\mathbf{y} = (0, 1)\) and \(\mathbf{z} = (0.3, 0.7)\), compute the cross-entropy loss \(H(\mathbf{y}, \mathbf{z})\).

E8.5.7 Given the vectors \(\mathbf{y} = (0, 1)\) and \(\mathbf{z} = (0.3, 0.7)\), compute the gradient of the cross-entropy loss \(\nabla H(\mathbf{y}, \mathbf{z})\).

E8.5.8 Given the vectors \(\mathbf{w} = (1, 2, -1, 0)\) and \(\mathbf{z} = (1, 2)\), compute \(\mathbb{A}_2[\mathbf{w}]\) and \(\mathbb{B}_2[\mathbf{z}]\).

8.6.2. Problems#

8.1 Consider the affine map

\[ \mathbf{f}(\mathbf{x}) = A \mathbf{x} + \mathbf{b}, \]

where \(A \in \mathbb{R}^{m \times d}\) and \(\mathbf{b} = (b_1, \ldots, b_m)^T \in \mathbb{R}^m\). Let \(S \subseteq \mathbb{R}^m\) be a convex set. Show that the following set is convex:

\[ T = \left\{ \mathbf{x} \in \mathbb{R}^d \,:\, \mathbf{f}(\mathbf{x}) \in S \right\}. \]

\(\lhd\)

8.2 (Adapted from [Khu]) Consider the vector-valued function \(\mathbf{f} = (f_1, \ldots, f_d) : \mathbb{R}^d \to \mathbb{R}^d\) defined as

\[ f_i(\mathbf{x}) = x_i^3, \]

for all \(\mathbf{x} \in \mathbb{R}^d\) and all \(i = 1, \ldots, d\).

a) Compute the Jacobian \(\mathbf{J}_\mathbf{f}(\mathbf{x})\) for all \(\mathbf{x}\).

b) When is \(\mathbf{J}_\mathbf{f}(\mathbf{x})\) invertible?

c) When is \(\mathbf{J}_\mathbf{f}(\mathbf{x})\) positive semidefinite? \(\lhd\)

8.3 Let \(A = (a_{i,j})_{i,j=1}^n \in \mathbb{R}^{n \times n}\) be a symmetric matrix.

a) Let \(\mathbf{v} = (v_1, \ldots, v_n)^T \in \mathbb{R}^n\) be an eigenvector of \(A\) with eigenvalue \(\lambda\). Let \(v_{i}\) be the largest element of \(\mathbf{v}\) in absolute value, that is, \(i \in \arg\max_j |v_j|\). Define the vector \(\mathbf{w} = (w_1, \ldots, w_n)^T\) as

\[ w_j = \frac{v_j}{v_{i}}, \qquad j=1, \ldots, n. \]

Show that

\[ \lambda - a_{i,i} = \sum_{j \neq i} a_{i,j} w_j. \]

b) Use (a) to show that, for any eigenvalue \(\lambda\) of \(A\), there is \(i\) such that

\[ |\lambda - a_{i,i}| \leq \sum_{j \neq i} |a_{i,j}|. \]

\(\lhd\)

8.4 A symmetric matrix \(A = (a_{i,j})_{i,j=1}^n \in \mathbb{R}^{n \times n}\) with nonnegative elements on its diagonal is said to be diagonally dominant if for all \(i\)

\[ a_{i,i} \geq \sum_{j \neq i} |a_{i,j}|, \]

that is, each diagonal element is greater or equal than the sum of the absolute values of the other elements in its row. Use Problem 8.3 to prove that such a matrix is positive semidefinite. \(\lhd\)

8.5 Consider multinomial logistic regression. Let

\[ R = I_{K \times K} \otimes \mathbf{x}^T, \]

and

\[ S = \mathrm{diag}\left( \bgamma \left( \mathbf{v} \right) \right) - \bgamma \left( \mathbf{v} \right) \, \bgamma \left( \mathbf{v} \right)^T \]

where

\[ \mathbf{v} = \bgamma \left( \bfg_0 (\mathbf{x}, \mathbf{w}) \right). \]

a) Show that

\[ \nabla f(\mathbf{w}) = \Gamma \left( \bgamma \left( \bfg_0 (\mathbf{x}, \mathbf{w}) \right) \right) \]

where

\[ \Gamma (\mathbf{u}) = R (\mathbf{u} - \mathbf{y}). \]

b) Use the Chain Rule to show that

\[ H_f (\mathbf{w}) = R^T S R. \]

c) Use (b) and the Properties of the Kronecker Product to show that

\[ H_f (\mathbf{w}) = \left( \mathrm{diag} \left( \bgamma \left( \mathcal{W} \mathbf{x} \right) \right) - \bgamma \left( \mathcal{W} \mathbf{x} \right) \, \bgamma \left( \mathcal{W} \mathbf{x} \right)^T \right) \otimes \mathbf{x} \mathbf{x}^T. \]

\(\lhd\)

8.6 Consider multinomial logistic regression. Use Problems 8.4 and 8.5 to show that the objective function is convex. [Hint: It is enough to show that \(S\) (defined in Problem 8.5) is diagonally dominant. Why?] \(\lhd\)

8.7 Prove part (a) of the Kronecker Product Properties Lemma. \(\lhd\)

8.8 Prove part (b) of the Kronecker Product Properties Lemma. \(\lhd\)

8.9 Prove parts © and (d) of the Kronecker Product Properties Lemma. \(\lhd\)

8.10 Prove part (e) of the Kronecker Product Properties Lemma. \(\lhd\)

8.11 Let \(A\) and \(B\) be symmetric matrices of size \(n \times n\) and \(m \times m\) respectively.

a) Show that \(A \otimes B\) is symmetric. [Hint: Use Problem 8.10.]

b) Compute the eigenvectors and eigenvalues of \(A \otimes B\) in terms of the eigenvectors and eigenvalues of \(A\) and \(B\). [Hint: Try the Kronecker products of eigenvectors of \(A\) and \(B\).]

c) Recall that the determinant of a symmetric matrix is the product of its eigenvalues. Show that

\[ \mathrm{det}(A \otimes B) = \mathrm{det}(A)^n \,\mathrm{det}(B)^m. \]

\(\lhd\)

8.12 Compute \(\mathrm{tr}(A \otimes B)\) in terms of \(\mathrm{tr}(A)\) and \(\mathrm{tr}(B)\). Justify your answer. \(\lhd\)

8.13 (a) Show that if \(D_1\) and \(D_2\) are square diagonal matrices, then so is \(D_1 \otimes D_2\).

(b) Show that if \(Q_1\) and \(Q_2\) have orthonormal columns, so does \(Q_1 \otimes Q_2\). \(\lhd\)

8.14 Let \(A_1 = U_1 \Sigma_1 V_1^T\) and \(A_2 = U_2 \Sigma_2 V_2^T\) be full SVDs of \(A_1, A_2 \in \mathbb{R}^{n \times n}\) respectively.

a) Compute a full SVD of \(A_1 \otimes A_2\). [Hint: Use Problem 8.13.]

b) Show that the rank of \(A_1 \otimes A_2\) is \(\mathrm{rk}(A_1) \,\mathrm{rk}(A_2)\). \(\lhd\)

8.15 Let \(P_1\) and \(P_2\) be transition matrices.

a) Let \(\bpi_1\) and \(\bpi_2\) (as row vectors) be stationary distributions of \(P_1\) and \(P_2\) respectively. Show that \(\bpi_1 \otimes \bpi_2\) is a stationary distribution of \(P_1 \otimes P_2\).

b) Assume that \(P_1\) and \(P_2\) are both irreducible and weakly lazy. Show that the same holds for \(P_1 \otimes P_2\). \(\lhd\)

8.16 Let \(\mathbf{u}\) be a column vector and \(A, B\) be matrices of such size that one can form the matrix product \(AB\).

a) Let \(\mathbf{a}_1^T, \ldots, \mathbf{a}_n^T\) be the rows of \(A\). Prove that

\[\begin{split} A \otimes \mathbf{u} = \begin{pmatrix} \mathbf{u} \mathbf{a}_1^T\\ \vdots\\ \mathbf{u} \mathbf{a}_n^T \end{pmatrix}. \end{split}\]

b) Prove part (f) of the Kronecker Product Properties Lemma. \(\lhd\)

8.17 Prove the Composition of Continuous Functions Lemma. \(\lhd\)

8.18 Consider the map \(X \mathbf{z}\) as a function of the entries of the matrix \(X \in \mathbb{R}^{n \times m}\). Specifically, for a fixed \(\mathbf{z} \in \mathbb{R}^m\), letting \((\mathbf{x}^{(i)})^T\) be the \(i\)-th row of \(X\) we define the function

\[\begin{split} \mathbf{f}(\mathbf{x}) = X \mathbf{z} = \begin{pmatrix} (\mathbf{x}^{(1)})^T \mathbf{z} \\ \vdots\\ (\mathbf{x}^{(n)})^T \mathbf{z} \end{pmatrix} \end{split}\]

where \(\mathbf{x} = \mathrm{vec}(X^T) = (\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})\). Show that \(\mathbf{f}\) is linear in \(\mathbf{x}\), that is, \(\mathbf{f}(\mathbf{x} + \mathbf{x}') = \mathbf{f}(\mathbf{x}) + \mathbf{f}(\mathbf{x}')\).

\(\lhd\)

8.19 Let \(f(x_1, x_2) = \sin(x_1^2 + x_2) + \cos(x_1 x_2)\). Compute the gradient of \(f\) using the Chain Rule by defining appropriate functions \(\mathbf{g}\) and \(h\) such that \(f = h \circ \mathbf{g}\).

\(\lhd\)

8.20 Consider the function \(f(x_1, x_2, x_3) = \sqrt{x_1 + x_2^2 + \exp(x_3)}\). Find the gradient of \(f\) at the point \((1, 2, 0)\) using the Chain Rule by defining suitable functions \(\mathbf{g}\) and \(h\) such that \(f = h \circ \mathbf{g}\).

\(\lhd\)

8.21 Consider the function \(f(x_1, x_2, x_3) = (x_1 + x_2^2)^3 + \sin(x_2 x_3)\). Find the gradient of \(f\) using the Chain Rule by defining suitable functions \(\mathbf{g}\) and \(h\) such that \(f = h \circ \mathbf{g}\).

\(\lhd\)

8.22 For \(i=1, \ldots, n\), let \(f_i : D_i \to \mathbb{R}\), with \(D_i \subseteq \mathbb{R}\), be a continuously differentiable real-valued function of a single variable. Consider the vector-valued function \(\mathbf{f} : D_1 \times \cdots \times D_n \to \mathbb{R}^n\) defined as

\[ \mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_n(\mathbf{x})) = (f_1(x_1), \ldots, f_n(x_n)). \]

For \(\mathbf{x} = (x_1, \ldots, x_n)\) such that \(x_i\) is an interior point of \(D_i\) for all \(i\), compute the Jacobian \(J_{\mathbf{f}}(\mathbf{x})\).

\(\lhd\)

8.23 Let \(f\) be a real-valued function taking a matrix \(A = (a_{i,j})_{i,j} \in \mathbb{R}^{n \times n}\) as an input. Assume \(f\) is continuously differentiable in each entry of \(A\). Consider the following matrix derivative

\[\begin{split} \frac{\partial f(A)}{\partial A} = \begin{pmatrix} \frac{\partial f(A)}{\partial a_{1,1}} & \cdots & \frac{\partial f(A)}{\partial a_{1,n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial f(A)}{\partial a_{n,1}} & \cdots & \frac{\partial f(A)}{\partial a_{n,n}} \end{pmatrix}. \end{split}\]

a) Show that, for any \(B \in \mathbb{R}^{n \times n}\),

\[ \frac{\partial \,\mathrm{tr}(B^T A)}{\partial A} = B. \]

b) Show that, for any \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^d\),

\[ \frac{\partial \,\mathbf{x}^T A \mathbf{y}}{\partial A} = \mathbf{x} \mathbf{y}^T. \]

\(\lhd\)

8.24 Let \(A = (a_{i,j})_{i \in [n], j \in [m]} \in \mathbb{R}^{n \times m}\) and \(B = (b_{i,j})_{i \in [p], j \in [q]} \in \mathbb{R}^{p \times q}\) be arbitrary matrices. Their Kronecker product, denoted \(A \otimes B \in \mathbb{R}^{np \times mq}\), is the following matrix in block form

\[\begin{split} A \otimes B = \begin{pmatrix} a_{1,1} B & \cdots & a_{1,m} B \\ \vdots & \ddots & \vdots \\ a_{n,1} B & \cdots & a_{n,m} B \end{pmatrix}. \end{split}\]

The Kronecker product satisfies the following properties (which follow from block formulas, but which you do not have to prove): 1) if \(A, B, C, D\) are matrices of such size that one can form the matrix products \(AC\) and \(BD\), then \((A \otimes B)\,(C \otimes D) = (AC) \otimes (BD)\); 2) the transpose of \(A \otimes B\) is \((A \otimes B)^T = A^T \otimes B^T\).

a) Show that if \(D_1\) and \(D_2\) are square diagonal matrices, then so is \(D_1 \otimes D_2\).

b) Show that if \(Q_1\) and \(Q_2\) have orthonormal columns, so does \(Q_1 \otimes Q_2\).

c) Let \(A_1 = U_1 \Sigma_1 V_1^T\) and \(A_2 = U_2 \Sigma_2 V_2^T\) be full SVDs of \(A_1, A_2 \in \mathbb{R}^{n \times n}\) respectively. Compute a full SVD of \(A_1 \otimes A_2\).

d) Let \(A_1\) and \(A_2\) be as in ©. Show that the rank of \(A_1 \otimes A_2\) is \(\mathrm{rk}(A_1) \,\mathrm{rk}(A_2)\).

\(\lhd\)