\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

8.7. Online supplementary material#

8.7.1. Quizzes, solutions, code, etc.#

8.7.1.1. Just the code#

An interactive Jupyter notebook featuring the code in this chapter can be accessed below (Google Colab recommended). You are encouraged to tinker with it. Some suggested computational exercises are scattered throughout. The notebook is also available as a slideshow.

8.7.1.2. Self-assessment quizzes#

A more extensive web version of the self-assessment quizzes is available by following the links below.

8.7.1.3. Auto-quizzes#

Automatically generated quizzes for this chapter can be accessed here (Google Colab recommended).

8.7.1.4. Solutions to odd-numbered warm-up exercises#

(with help from Claude, Gemini, and ChatGPT)

E8.2.1 The vectorization is obtained by stacking the columns of \(A\): \(\text{vec}(A) = (2, 0, 1, -1)\).

E8.2.3

\[\begin{split} A \otimes B = \begin{pmatrix} 1 \cdot B & 2 \cdot B \\ -1 \cdot B & 0 \cdot B \end{pmatrix} = \begin{pmatrix} 3 & -1 & 6 & -2 \\ 2 & 1 & 4 & 2 \\ -3 & 1 & -6 & 2 \\ -2 & -1 & -4 & -2 \end{pmatrix}. \end{split}\]

E8.2.5

\[\begin{split} A \otimes B = \begin{pmatrix} 1 \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix} & 2 \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix} \\ 3 \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix} & 4 \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix} \end{pmatrix} = \begin{pmatrix} 5 & 6 & 10 & 12 \\ 7 & 8 & 14 & 16 \\ 15 & 18 & 20 & 24 \\ 21 & 24 & 28 & 32 \end{pmatrix}. \end{split}\]

E8.2.7 First, compute the Jacobian matrices of \(\mathbf{f}\) and \(\mathbf{g}\):

\[\begin{split} J_{\mathbf{f}}(x, y) = \begin{pmatrix} 2x & 2y \\ y & x \end{pmatrix}, \quad J_{\mathbf{g}}(u, v) = \begin{pmatrix} v & u \\ 1 & 1 \end{pmatrix}. \end{split}\]

Then, by the Chain Rule,

\[\begin{split} J_{\mathbf{g} \circ \mathbf{f}}(1, 2) = J_{\mathbf{g}}(\mathbf{f}(1, 2)) \, J_{\mathbf{f}}(1, 2) = J_{\mathbf{g}}(5, 2) \, J_{\mathbf{f}}(1, 2) = \begin{pmatrix} 2 & 5 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 2 & 4 \\ 2 & 1 \end{pmatrix} = \begin{pmatrix} 14 & 13 \\ 4 & 5 \end{pmatrix}. \end{split}\]

E8.2.9 From E8.2.5, we have

\[\begin{split} A \otimes B = \begin{pmatrix} 5 & 6 & 10 & 12 \\ 7 & 8 & 14 & 16 \\ 15 & 18 & 20 & 24 \\ 21 & 24 & 28 & 32 \end{pmatrix} \end{split}\]

So,

\[\begin{split}(A \otimes B)^T = \begin{pmatrix} 5 & 7 & 15 & 21 \\ 6 & 8 & 18 & 24 \\ 10 & 14 & 20 & 28 \\ 12 & 16 & 24 & 32 \end{pmatrix}. \end{split}\]

Now,

\[\begin{split} A^T = \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}, \quad B^T = \begin{pmatrix} 5 & 7 \\ 6 & 8 \end{pmatrix} \end{split}\]

So,

\[\begin{split} A^T \otimes B^T = \begin{pmatrix} 1 \begin{pmatrix} 5 & 7 \\ 6 & 8 \end{pmatrix} & 3 \begin{pmatrix} 5 & 7 \\ 6 & 8 \end{pmatrix} \\ 2 \begin{pmatrix} 5 & 7 \\ 6 & 8 \end{pmatrix} & 4 \begin{pmatrix} 5 & 7 \\ 6 & 8 \end{pmatrix} \end{pmatrix} = \begin{pmatrix} 5 & 7 & 15 & 21 \\ 6 & 8 & 18 & 24 \\ 10 & 14 & 20 & 28 \\ 12 & 16 & 24 & 32 \end{pmatrix}. \end{split}\]

We see that \((A \otimes B)^T = A^T \otimes B^T\), as expected from the properties of the Kronecker product.

E8.2.11

\[\begin{split} \nabla f(x, y, z) = \begin{pmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \frac{\partial f}{\partial z} \end{pmatrix} = \begin{pmatrix} 2x \\ 2y \\ 2z \end{pmatrix}. \end{split}\]

So,

\[\begin{split} \nabla f(1, 2, 3) = \begin{pmatrix} 2 \\ 4 \\ 6 \end{pmatrix}. \end{split}\]

E8.2.13 First, compute the gradient of \(f\) and the Jacobian matrix of \(\mathbf{g}\):

\[\begin{split} \nabla f(x, y) = \begin{pmatrix} y \\ x \end{pmatrix}, \quad J_{\mathbf{g}}(x, y) = \begin{pmatrix} 2x & 0 \\ 0 & 2y \end{pmatrix}. \end{split}\]

Then, by the Chain Rule,

\[\begin{split} J_{f \circ \mathbf{g}}(1, 2) = \nabla f(\mathbf{g}(1, 2))^T \, J_{\mathbf{g}}(1, 2) = \nabla f(1, 4)^T \, J_{\mathbf{g}}(1, 2) = \begin{pmatrix} 4 & 1 \end{pmatrix} \begin{pmatrix} 2 & 0 \\ 0 & 4 \end{pmatrix} = \begin{pmatrix} 8 & 4 \end{pmatrix}. \end{split}\]

E8.2.15 The Jacobian matrix of \(\mathbf{g}\) is

\[\begin{split} J_{\mathbf{g}}(x, y, z) = \begin{pmatrix} f'(x) & 0 & 0 \\ 0 & f'(y) & 0 \\ 0 & 0 & f'(z) \end{pmatrix} \end{split}\]

where \(f'(x) = \cos(x).\) So,

\[\begin{split} J_{\mathbf{g}}(\frac{\pi}{2}, \frac{\pi}{4}, \frac{\pi}{6}) = \begin{pmatrix} \cos(\frac{\pi}{2}) & 0 & 0 \\ 0 & \cos(\frac{\pi}{4}) & 0 \\ 0 & 0 & \cos(\frac{\pi}{6}) \end{pmatrix} = \begin{pmatrix} 0 & 0 & 0 \\ 0 & \frac{\sqrt{2}}{2} & 0 \\ 0 & 0 & \frac{\sqrt{3}}{2} \end{pmatrix}. \end{split}\]

E8.3.1 Each entry of \(AB\) is the dot product of a row of \(A\) and a column of \(B\), which takes 2 multiplications and 1 addition. Since \(AB\) has 4 entries, the total number of operations is \(4 \times 3 = 12\).

E8.3.3 We have

\[ \ell(\hat{\mathbf{y}}) = \hat{y}_1^2 + \hat{y}_2^2 \]

so the partial derivatives are \(\frac{\partial \ell}{\partial \hat{y}_1} = 2 \hat{y}_1\) and \(\frac{\partial \ell}{\partial \hat{y}_2} = 2 \hat{y}_2\) and

\[ J_{\ell}(\hat{\mathbf{y}}) = 2 \hat{\mathbf{y}}^T. \]

E8.3.5 From E8.3.4, we have \(\mathbf{z}_1 = \mathbf{g}_0(\mathbf{z}_0) = (-1, -1)\). Then, \(\mathbf{z}_2 = \mathbf{g}_1(\mathbf{z}_1) = (1, -2)\) and \(f(\mathbf{x}) = \ell(\mathbf{z}_2) = 5\). By the Chain Rule,

\[\begin{split} \nabla f(\mathbf{x})^T = J_f(\mathbf{x}) = J_{\ell}(\mathbf{z}_1) \,J_{\mathbf{g}_1}(\mathbf{z}_1) \,J_{\mathbf{g}_0}(\mathbf{z}_0) = 2 \mathbf{z}_2^T \begin{pmatrix} -1 & 0 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 2 \\ -1 & 0 \end{pmatrix} = (-10, -4) \begin{pmatrix} 1 & 2 \\ -1 & 0 \end{pmatrix} = (6, -20). \end{split}\]

E8.3.7 We have

\[ g_1(\mathbf{z}_1, \mathbf{w}_1) = w_4 z_{1,1} + w_5 z_{1,2} \]

so, by computing all partial derivatives,

\[ J_{g_1}(\mathbf{z}_1, \mathbf{w}_1) = \begin{pmatrix} w_4 & w_5 & z_{1,1} & z_{1,2} \end{pmatrix} = \begin{pmatrix} \mathbf{w}_1^T & \mathbf{z}_1^T \end{pmatrix} = \begin{pmatrix} W_1 & I_{1 \times 1} \otimes \mathbf{z}_1^T \end{pmatrix}. \]

Using the notation in the text, \(A_1 = W_1\) and \(B_1 = \mathbf{z}_1^T = I_{1 \times 1} \otimes \mathbf{z}_1^T\).

E8.3.9 We have

\[ f(\mathbf{w}) = (w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3))^2 \]

so, by the Chain Rule, the partial derivatives are

\[ \frac{\partial f}{\partial w_0} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (-w_4) \]
\[ \frac{\partial f}{\partial w_1} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (w_4) \]
\[ \frac{\partial f}{\partial w_2} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (-w_5) \]
\[ \frac{\partial f}{\partial w_3} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (w_5) \]
\[ \frac{\partial f}{\partial w_4} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (-w_0 + w_1) \]
\[ \frac{\partial f}{\partial w_5} = 2(w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (-w_2 + w_3). \]

Moreover, by E8.3.7,

\[\begin{split} z_2 = g_1(\mathbf{z}_1, \mathbf{w}_1) = W_1 \mathbf{z}_1 = \begin{pmatrix} w_4 & w_5\end{pmatrix} \begin{pmatrix}- w_0 + w_1\\-w_2 + w_3\end{pmatrix} = w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3). \end{split}\]

By the fundamental recursion and the results in E8.3.3 and E8.3.8,

\[\begin{align*} J_f(\mathbf{w}) &= J_{\ell}(h(\mathbf{w})) \,J_{h}(\mathbf{w}) = 2 z_2 \begin{pmatrix} A_1 B_0 & B_1\end{pmatrix}\\ &= 2 (w_4 (- w_0 + w_1) + w_5 (-w_2 + w_3)) (-w_4, w_4, -w_5, w_5, -w_0 + w_1, -w_2 + w_3). \end{align*}\]

E8.4.1 The full gradient descent step is:

\[ \frac{1}{5} \sum_{i=1}^5 \nabla f_{x_i, y_i}(w) = \frac{1}{5}((1, 2) + (-1, 1) + (0, -1) + (2, 0) + (1, 1)) = (\frac{3}{5}, \frac{3}{5}). \]

The expected SGD step with a batch size of 2 is:

\[ \mathbb{E} [\frac{1}{2} \sum_{i\in B} \nabla f_{x_i, y_i}(w)] = \frac{1}{5} \sum_{i=1}^5 \nabla f_{x_i, y_i}(w) = (\frac{3}{5}, \frac{3}{5}), \]

which is equal to the full gradient descent step, as proven in the “Expected SGD Step” lemma.

E8.4.3

\[\begin{align*} \mathrm{KL}(p \| q) &= \sum_{i=1}^3 p_i \log \frac{p_i}{q_i} \\ &= 0.2 \log \frac{0.2}{0.1} + 0.3 \log \frac{0.3}{0.4} + 0.5 \log \frac{0.5}{0.5} \\ &\approx 0.2 \cdot 0.6931 + 0.3 \cdot (-0.2877) + 0.5 \cdot 0 \\ &\approx 0.0525. \end{align*}\]

E8.4.5 The SGD update is given by

\[\begin{align*} w &\leftarrow w - \frac{\alpha}{|B|} \sum_{i \in B} \frac{\partial \ell}{\partial w}(w, b; x_i, y_i), \\ b &\leftarrow b - \frac{\alpha}{|B|} \sum_{i \in B} \frac{\partial \ell}{\partial b}(w, b; x_i, y_i). \end{align*}\]

Plugging in the values, we get

\[\begin{align*} w &\leftarrow 1 - \frac{0.1}{2} (2 \cdot 2(2 \cdot 1 + 0 - 3) + 2 \cdot 1(1 \cdot 1 + 0 - 2)) = 1.3, \\ b &\leftarrow 0 - \frac{0.1}{2} (2(2 \cdot 1 + 0 - 3) + 2(1 \cdot 1 + 0 - 2)) = 0.3. \end{align*}\]

E8.4.7

\[\begin{align*} \nabla \ell(w; x, y) &= -\frac{y}{\sigma(wx)} \sigma'(wx) x + \frac{1-y}{1-\sigma(wx)} \sigma'(wx) x \\ &= -yx(1 - \sigma(wx)) + x(1-y)\sigma(wx) \\ &= x(\sigma(wx) - y). \end{align*}\]

We used the fact that \(\sigma'(z) = \sigma(z)(1-\sigma(z))\).

E8.4.9 First, we compute \(z_1 = Wx = \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix} (1, 2) = (0, 0, 0)\). Then, \(\hat{y} = \gamma(z_1) = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})\). From the text, we have:

\[\begin{split} \nabla f(w) = (\gamma(Wx) - y) \otimes x = (\hat{y} - y) \otimes x = (\frac{1}{3}, \frac{1}{3}, -\frac{2}{3}) \otimes (1, 2) = \begin{pmatrix} \frac{1}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{2}{3} \\ -\frac{2}{3} & -\frac{4}{3} \end{pmatrix}. \end{split}\]

E8.4.11 First, we compute the individual gradients:

\[\begin{align*} \nabla f_{x_1, y_1}(W) &= (\gamma(Wx_1) - y_1) \otimes x_1 = (\frac{1}{3}, \frac{1}{3}, -\frac{2}{3}) \otimes (1, 2) = \begin{pmatrix} \frac{1}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{2}{3} \\ -\frac{2}{3} & -\frac{4}{3} \end{pmatrix}, \\ \nabla f_{x_2, y_2}(W) &= (\gamma(Wx_2) - y_2) \otimes x_2 = (-\frac{2}{3}, \frac{1}{3}, \frac{1}{3}) \otimes (4, -1) = \begin{pmatrix} -\frac{8}{3} & \frac{2}{3} \\ \frac{4}{3} & -\frac{1}{3} \\ \frac{4}{3} & -\frac{1}{3} \end{pmatrix}. \end{align*}\]

Then, the full gradient is:

\[\begin{split} \frac{1}{2} (\nabla f_{x_1, y_1}(W) + \nabla f_{x_2, y_2}(W)) = \frac{1}{2} \left(\begin{pmatrix} \frac{1}{3} & \frac{2}{3} \\ \frac{1}{3} & \frac{2}{3} \\ -\frac{2}{3} & -\frac{4}{3} \end{pmatrix} + \begin{pmatrix} -\frac{8}{3} & \frac{2}{3} \\ \frac{4}{3} & -\frac{1}{3} \\ \frac{4}{3} & -\frac{1}{3} \end{pmatrix}\right) = \begin{pmatrix} -\frac{11}{6} & \frac{2}{3} \\ \frac{5}{6} & \frac{1}{6} \\ \frac{1}{3} & -\frac{5}{6} \end{pmatrix}. \end{split}\]

E8.4.13 The cross-entropy loss is given by

\[ -\log(0.3) \approx 1.204. \]

E8.5.1 \(\sigma(1) = \frac{1}{1 + e^{-1}} \approx 0.73\) \(\sigma(-1) = \frac{1}{1 + e^{1}} \approx 0.27\) \(\sigma(2) = \frac{1}{1 + e^{-2}} \approx 0.88\)

E8.5.3 \(\bsigma(\mathbf{z}) = (\bsigma(1), \bsigma(-1), \bsigma(2)) \approx (0.73, 0.27, 0.88)\) \(\bsigma'(\mathbf{z}) = (\bsigma'(1), \bsigma'(-1), \bsigma'(2)) \approx (0.20, 0.20, 0.10)\)

E8.5.5 \(W\mathbf{x} = \begin{pmatrix} -1 \\ 4 \end{pmatrix}\), so \(\sigma(W\mathbf{x}) \approx (0.27, 0.98)\), \(J_\bsigma(W\mathbf{x}) = \mathrm{diag}(\bsigma(W\mathbf{x}) \odot (1 - \bsigma(W\mathbf{x}))) \approx \begin{pmatrix} 0.20 & 0 \\ 0 & 0.02 \end{pmatrix}\)

E8.5.7 \(\nabla H(\mathbf{y}, \mathbf{z}) = (-\frac{y_1}{z_1}, -\frac{y_2}{z_2}) = (-\frac{0}{0.3}, -\frac{1}{0.7}) \approx (0, -1.43)\)

8.7.1.5. Learning outcomes#

  • Define the Jacobian matrix and use it to compute the differential of a vector-valued function.

  • State and apply the generalized Chain Rule to compute the Jacobian of a composition of functions.

  • Perform calculations with the Hadamard and Kronecker products of matrices.

  • Describe the purpose of automatic differentiation and its advantages over symbolic and numerical differentiation.

  • Implement automatic differentiation in PyTorch to compute gradients of vector-valued functions.

  • Derive the Chain Rule for multi-layer progressive functions and apply it to compute gradients.

  • Compare and contrast the forward and reverse modes of automatic differentiation in terms of computational complexity.

  • Define progressive functions and identify their key properties.

  • Derive the fundamental recursion for the Jacobian of a progressive function.

  • Implement the backpropagation algorithm to efficiently compute gradients of progressive functions.

  • Analyze the computational complexity of the backpropagation algorithm in terms of the number of matrix-vector products.

  • Describe the stochastic gradient descent (SGD) algorithm and explain how it differs from standard gradient descent.

  • Derive the update rule for stochastic gradient descent from the gradient of the loss function.

  • Prove that the expected SGD step is equal to the full gradient descent step.

  • Evaluate the performance of models trained using stochastic gradient descent on real-world datasets.

  • Define the multilayer perceptron (MLP) architecture and describe the role of affine maps and activation functions in each layer.

  • Compute the Jacobian of the sigmoid activation function using properties of diagonal matrices and Kronecker products.

  • Apply the chain rule to calculate the gradient of the loss function with respect to the weights in a small MLP example.

  • Generalize the gradient computation for an MLP with an arbitrary number of layers using a forward and backward pass.

  • Implement the training of a neural network using PyTorch.

\(\aleph\)

8.7.2. Additional sections#

8.7.2.1. Another example: linear regression#

We give another concrete example of progressive functions and of the application of backpropagration and stochastic gradient descent.

Computing the gradient While we have motivated the framework introduced in the previous section from the point of view of classification, it also immediately applies to the regression setting. Both classification and regression are instances of supervised learning.

We first compute the gradient of a single sample. Here \(\mathbf{x} \in \mathbb{R}^d\) again, but \(y\) is a real-valued outcome variable. We revisit the case of linear regression where the loss function is

\[ \ell(z) = (z - y)^2 \]

and the regression function only has input and output layers and no hidden layer (that is, \(L=0\)) with

\[ h(\mathbf{w}) = \sum_{j=1}^d w_{j} x_{j} = \mathbf{x}^T\mathbf{w}, \]

where \(\mathbf{w} \in \mathbb{R}^{d}\) are the parameters. Recall that we can include a constant term (one that does not depend on the input) by adding a \(1\) to the input. To keep the notation simple, we assume that this pre-processing has already been performed if desired.

Finally, the objective function for a single sample is

\[ f(\mathbf{w}) = \ell(h(\mathbf{w})) = \left(\sum_{j=1}^d w_{j} x_{j} - y\right)^2 = \left(\mathbf{x}^T\mathbf{w} - y\right)^2. \]

Using the notation from the previous subsection, the forward pass in this case is:

Initialization:

\[\mathbf{z}_0 := \mathbf{x}.\]

Forward layer loop:

\[\begin{align*} \hat{y} := z_1 := g_0(\mathbf{z}_0,\mathbf{w}_0) &= \sum_{j=1}^d w_{0,j} z_{0,j} = \mathbf{z}_0^T \mathbf{w}_0 \end{align*}\]
\[\begin{align*} \begin{pmatrix} A_0 & B_0 \end{pmatrix} := J_{g_0}(\mathbf{z}_0,\mathbf{w}_0) &= ( w_{0,1},\ldots, w_{0,d},z_{0,1},\ldots,z_{0,d} )^T = \begin{pmatrix}\mathbf{w}_0^T & \mathbf{z}_0^T\end{pmatrix}, \end{align*}\]

where \(\mathbf{w}_0 := \mathbf{w}\).

Loss:

\[\begin{align*} z_2 &:= \ell(z_1) = (z_1 - y)^2\\ p_2 &:= \frac{\mathrm{d}}{\mathrm{d} z_1} {\ell}(z_1) = 2 (z_1 - y). \end{align*}\]

The backward pass is:

Backward layer loop:

\[\begin{align*} \mathbf{q}_0 := B_0^T p_1 &= 2 (z_1 - y) \, \mathbf{z}_0. \end{align*}\]

Output:

\[ \nabla f(\mathbf{w}) = \mathbf{q}_0. \]

As we noted before, there is in fact no need to compute \(A_0\) and \(\mathbf{p}_0\).

The Advertising dataset and the least-squares solution We return to the Advertising dataset.

data = pd.read_csv('advertising.csv')
data.head()
Unnamed: 0 TV radio newspaper sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
n = len(data.index)
print(n)
200

We first compute the solution using the least-squares approach we detailed previously. We use numpy.column_stack to add a column of ones to the feature vectors.

TV = data['TV'].to_numpy()
radio = data['radio'].to_numpy()
newspaper = data['newspaper'].to_numpy()
sales = data['sales'].to_numpy()
features = np.stack((TV, radio, newspaper), axis=-1)
A = np.column_stack((np.ones(n), features))
coeff = mmids.ls_by_qr(A, sales)
print(coeff)
[ 2.93888937e+00  4.57646455e-02  1.88530017e-01 -1.03749304e-03]

The MSE is:

np.mean((A @ coeff - sales)**2)
2.7841263145109365

Solving the problem using PyTorch We will be using PyTorch to implement the previous method. We first convert the data into PyTorch tensors. We then use torch.utils.data.TensorDataset to create the dataset. Finally, torch.utils.data.DataLoader provides the utilities to load the data in batches for training. We take mini-batches of size BATCH_SIZE = 64 and we apply a random permutation of the samples on every pass (with the option shuffle=True).

from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim

features_tensor = torch.tensor(features, dtype=torch.float32)
sales_tensor = torch.tensor(sales, dtype=torch.float32).view(-1, 1)

BATCH_SIZE = 64
train_dataset = TensorDataset(features_tensor, sales_tensor)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

Now we construct our model. It is simply an affine map from \(\mathbb{R}^3\) to \(\mathbb{R}\). Note that there is no need to pre-process the inputs by adding \(1\)s. A constant term (or “bias variable”) is automatically added by PyTorch (unless one chooses the option bias=False).

model = nn.Sequential(
    nn.Linear(3, 1)  # 3 input features, 1 output value
)

Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many optimizers available. (See this post for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. And the loss function is the MSE. A quick tutorial is here.

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-5)

Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here \(10^4\) suffices. But in the interest of time, we will run it only for \(100\) epochs. As you will see from the results, this is not quite enough. On each pass, we compute the output of the current model, use backward() to obtain the gradient, and then perform a descent update with step(). We also have to reset the gradients first (otherwise they add up by default).

epochs = 100
for epoch in range(epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

The final parameters and loss are:

Hide code cell source
weights = model[0].weight.detach().numpy()
bias = model[0].bias.detach().numpy()
print("Weights:", weights)
print("Bias:", bias)
Weights: [[0.0591674  0.23021418 0.01832497]]
Bias: [-0.57107234]
Hide code cell source
# Evaluate the model
model.eval()
with torch.no_grad():
    total_loss = 0
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        total_loss += loss.item()
        
    print(f"Mean Squared Error on Training Set: {total_loss / len(train_loader)}")
Mean Squared Error on Training Set: 4.5673341155052185

8.7.2.2. Convolutional neural networks#

One can do even better using a neural network tailored for images, known as convolutional neural networks. From Wikipedia:

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

More background can be found in this excellent module from Stanford’s CS231n. Our CNN will be a composition of convolutional layers and pooling layers.

CHAT & LEARN Convolutional neural networks (CNNs) are powerful for image classification. Ask your favorite AI chatbot to explain the basic concepts of CNNs, including convolutional layers and pooling layers. \(\ddagger\)

from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

train_dataset = datasets.FashionMNIST(root='./data', train=True, 
                               download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False, 
                              download=True, transform=transforms.ToTensor())

BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

device = torch.device("cuda" if torch.cuda.is_available() 
                      else ("mps" if torch.backends.mps.is_available() 
                            else "cpu"))
print("Using device:", device)
Using device: mps

The new model is the following.

model = nn.Sequential(
    # First convolution, operating upon a 28x28 image
    nn.Conv2d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Second convolution, operating upon a 14x14 image
    nn.Conv2d(16, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Third convolution, operating upon a 7x7 image
    nn.Conv2d(32, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    # Flatten the tensor
    nn.Flatten(),

    # Fully connected layer
    nn.Linear(32 * 3 * 3, 10),
).to(device)

We train and test.

loss_fn = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters())
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)
Epoch 1/3
Epoch 2/3
Epoch 3/3
mmids.test(test_loader, model, loss_fn, device)
Test error: 88.1% accuracy

Note the high accuracy.

Finally, we try the Fashion-MNIST dataset. We use the same CNN.

train_dataset = datasets.FashionMNIST(root='./data', train=True, 
                                      download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False, 
                                     download=True, transform=transforms.ToTensor())

BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
loss_fn = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters())
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)
Epoch 1/3
Epoch 2/3
Epoch 3/3
mmids.test(test_loader, model, loss_fn, device)
Test error: 90.0% accuracy

The accuracy is not as high, as this is a more difficult dataset.

8.7.3. Additional proofs#

Proofs of Lagrange Multipliers Conditions We first prove the Lagrange Multipliers: First-Order Necessary Conditions. We follow the excellent textbook [Ber, Section 4.1]. The proof uses the concept of the Jacobian.

Proof idea: We reduce the problem to an unconstrained optimization problem by penalizing the constraint in the objective function. We then apply the unconstrained First-Order Necessary Conditions.

Proof: (Lagrange Multipliers: First-Order Necessary Conditions) We reduce the problem to an unconstrained optimization problem by penalizing the constraint in the objective function. We also add a regularization term to ensure that \(\mathbf{x}^*\) is the unique local minimizer in a neighborhood. Specifically, for each non-negative integer \(k\), consider the objective function

\[ F^k(\mathbf{x}) = f(\mathbf{x}) + \frac{k}{2} \|\mathbf{h}(\mathbf{x})\|^2 + \frac{\alpha}{2} \|\mathbf{x} - \mathbf{x}^*\|^2 \]

for some positive constant \(\alpha > 0\). Note that as \(k\) gets larger, the penalty becomes more significant and, therefore, enforcing the constraint becomes more desirable. The proof proceeds in several steps.

We first consider a version minimizing \(F^k\) constrained to lie in a neighborhood of \(\mathbf{x}^*\). Because \(\mathbf{x}^*\) is a local minimizer of \(f\) subject to \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\), there is \(\delta > 0\) such that \(f(\mathbf{x}^*)\leq f(\mathbf{x})\) for all feasible \(\mathbf{x}\) within

\[ \mathscr{X} = B_{\delta}(\mathbf{x}^*) = \{\mathbf{x}:\|\mathbf{x} - \mathbf{x}^*\| \leq \delta\}. \]

LEMMA (Step I: Solving the Penalized Problem in a Neighborhood of \(\mathbf{x}^*\)) For \(k \geq 1\), let \(\mathbf{x}^k\) be a global minimizer of the minimization problem

\[ \min_{\mathbf{x} \in \mathscr{X}} F^k(\mathbf{x}). \]

a) The sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) converges to \(\mathbf{x}^*\).

b) For \(k\) sufficiently large, \(\mathbf{x}^k\) is a local minimizer of the objective function \(F^k\) without any constraint.

\(\flat\)

Proof: The set \(\mathscr{X}\) is closed and bounded, and \(F^k\) is continuous. Hence, the sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) is well-defined by the Extreme Value Theorem. Let \(\bar{\mathbf{x}}\) be any limit point of \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\). We show that \(\bar{\mathbf{x}} = \mathbf{x}^*\). That will imply a). It also implied b) since hen, for large enough \(k\), \(\mathbf{x}^k\) must be an interior point of \(\mathscr{X}\).

Let \(-\infty < m \leq M < + \infty\) be the smallest and largest values of \(f\) on \(\mathscr{X}\), which exist by the Extreme Value Theorem. Then, for all \(k\), by definition of \(\mathbf{x}^k\) and the fact that \(\mathbf{x}^*\) is feasible

\[\begin{align*} (*) \qquad f(\mathbf{x}^k) &+ \frac{k}{2} \|\mathbf{h}(\mathbf{x}^k)\|^2 + \frac{\alpha}{2} \|\mathbf{x}^k - \mathbf{x}^*\|^2\\ &\leq f(\mathbf{x}^*) + \frac{k}{2} \|\mathbf{h}(\mathbf{x}^*)\|^2 + \frac{\alpha}{2} \|\mathbf{x}^* - \mathbf{x}^*\|^2 = f(\mathbf{x}^*). \end{align*}\]

Rearraning gives

\[ \|\mathbf{h}(\mathbf{x}^k)\|^2 \leq \frac{2}{k} \left[f(\mathbf{x}^*) - f(\mathbf{x}^k) - \frac{\alpha}{2} \|\mathbf{x}^k - \mathbf{x}^*\|^2\right] \leq \frac{2}{k} \left[ f(\mathbf{x}^*) - m\right]. \]

So \(\lim_{k \to \infty} \|\mathbf{h}(\mathbf{x}^k)\|^2 = 0\), which, by the continuity of \(\mathbf{h}\) and of the Frobenius norm, implies that \(\|\mathbf{h}(\bar{\mathbf{x}})\|^2 = 0\), that is, \(\mathbf{h}(\bar{\mathbf{x}}) = \mathbf{0}\). In other words, any limit point \(\bar{\mathbf{x}}\) is feasible.

In addition to being feasible, \(\bar{\mathbf{x}} \in \mathscr{X}\) because that constraint set is closed. So, by the choice of \(\mathscr{X}\), we have \(f(\mathbf{x}^*) \leq f(\bar{\mathbf{x}})\). Furthermore, by \((*)\), we get

\[ f(\mathbf{x}^*) \leq f(\bar{\mathbf{x}}) + \frac{\alpha}{2} \|\bar{\mathbf{x}} - \mathbf{x}^*\|^2 \leq f(\mathbf{x}^*). \]

This is only possible if \(\|\bar{\mathbf{x}} - \mathbf{x}^*\|^2 = 0\) or, put differently, \(\bar{\mathbf{x}} = \mathbf{x}^*\). That proves the lemma. \(\square\)

LEMMA (Step II: Applying the Unconstrained Necessary Conditions) Let \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) be the sequence in the previous lemma.

a) For sufficiently large \(k\), the vectors \(\nabla h_i(\mathbf{x}^k)\), \(i=1,\ldots,\ell\), are linearly independent.

b) Let \(\mathbf{J}_{\mathbf{h}}(\mathbf{x})\) be the Jacobian matrix of \(\mathbf{h}\), that is, the matrix whose rows are the (row) vectors \(\nabla h_i(\mathbf{x})^T\), \(i=1,\ldots,\ell\). Then

\[ \nabla f(\mathbf{x}^*) + \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \blambda^* = \mathbf{0} \]

where

\[ \blambda^* = - (\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \, \mathbf{J}_{\mathbf{h}}^T(\mathbf{x}^*))^{-1} \mathbf{J}_{\mathbf{h}}^T(\mathbf{x}^*) \nabla f(\mathbf{x}^*). \]

\(\flat\)

Proof: By the previous lemma, for \(k\) large enough, \(\mathbf{x}^k\) is an unconstrained local minimizer of \(F^k\). So by the (unconstrained) First-Order Necessary Conditions, it holds that

\[ \nabla F^k(\mathbf{x}^k) = \mathbf{0}. \]

To compute the gradient of \(F^k\) we note that

\[ \|\mathbf{h}(\mathbf{x})\|^2 = \sum_{i=1}^\ell (h_i(\mathbf{x}))^2. \]

The partial derivatives are

\[ \frac{\partial}{\partial x_j} \|\mathbf{h}(\mathbf{x})\|^2 = \sum_{i=1}^\ell \frac{\partial}{\partial x_j} (h_i(\mathbf{x}))^2 = \sum_{i=1}^\ell 2 h_i(\mathbf{x}) \frac{\partial h_i(\mathbf{x})}{\partial x_j}, \]

by the Chain Rule. So, in vector form,

\[ \nabla \|\mathbf{h}(\mathbf{x})\|^2 = 2 \mathbf{J}_{\mathbf{h}}(\mathbf{x})^T \mathbf{h}(\mathbf{x}). \]

The term \(\|\mathbf{x} - \mathbf{x}^*\|^2\) can be rewritten as the quadratic function

\[ \|\mathbf{x} - \mathbf{x}^*\|^2 = \frac{1}{2}\mathbf{x}^T (2 I_{d \times d}) \mathbf{x} - 2 (\mathbf{x}^*)^T \mathbf{x} + (\mathbf{x}^*)^T \mathbf{x}^*. \]

Using a previous formula with \(P = 2 I_{d \times d}\) (which is symmetric), \(\mathbf{q} = -2 \mathbf{x}^*\) and \(r = (\mathbf{x}^*)^T \mathbf{x}^*\), we get

\[ \nabla \|\mathbf{x} - \mathbf{x}^*\|^2 = 2\mathbf{x} -2 \mathbf{x}^*. \]

So, putting everything together,

\[ (**) \qquad \mathbf{0} = \nabla F^k(\mathbf{x}^k) = \nabla f(\mathbf{x}^k) + \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T (k \mathbf{h}(\mathbf{x}^k)) + \alpha(\mathbf{x}^k - \mathbf{x}^*). \]

By the previous lemma, \(\mathbf{x}^k \to \mathbf{x}^*\), \(\nabla f(\mathbf{x}^k) \to \nabla f(\mathbf{x}^*)\), and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\) as \(k \to +\infty\).

So it remains to derive the limit of \(k \mathbf{h}(\mathbf{x}^k)\). By assumption, the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T\) are linearly independent. That implies that for any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\)

\[ \mathbf{z}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} = \|\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z}\|^2 > 0 \]

otherwise we would have \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} = \mathbf{0}\), contradicting the linear independence assumption. By the Extreme Value Theorem, there is \(\beta > 0\) such that

\[ \mathbf{z}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} \geq \beta \]

for all unit vectors \(\mathbf{z} \in \mathbb{R}^\ell\). Since \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\), it follows from a previous lemma that, for \(k\) large enough and any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\),

\[ |\mathbf{z}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\, \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} - \mathbf{z}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T \mathbf{z}| \leq \| \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T - \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T \|_F \leq \frac{\beta}{2}. \]

That implies

\[ \mathbf{z}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T \mathbf{z} \geq \frac{\beta}{2}, \]

so by the same argument as above the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) are linearly independent for \(k\) large enough and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) is invertible. That proves a).

Going back to \((**)\), multiplying both sides by \((\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\, \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T)^{-1} \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\), and taking a limit \(k \to +\infty\), we get after rearranging that

\[ k \mathbf{h}(\mathbf{x}^k) \to - (\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) ^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*))^{-1} \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \nabla f(\mathbf{x}^*) = \blambda^*. \]

Plugging back we get

\[ \nabla f(\mathbf{x}^*) + \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \blambda^* = \mathbf{0} \]

as claimed. That proves b). \(\square\)

Combining the lemmas establishes the theorem. \(\square\)

Next, we prove the Lagrange Multipliers: Second-Order Sufficient Conditions. Again, we follow [Ber, Section 4.2]. We will need the following lemma. The proof can be skipped.

Proof idea: We consider a slightly modified problem

\[\begin{align*} &\text{min} f(\mathbf{x}) + \frac{c}{2} \|\mathbf{h}(\mathbf{x})\|^2\\ &\text{s.t.}\ \mathbf{h}(\mathbf{x}) = \mathbf{0} \end{align*}\]

which has the same local minimizers. Applying the Second-Order Sufficient Conditions to the Lagrangian of the modified problem gives the result when \(c\) is chosen large enough.

LEMMA Let \(P\) and \(Q\) be symmetric matrices in \(\mathbb{R}^{n \times n}\). Assume that \(Q\) is positive semidefinite and that \(P\) is positive definite on the null space of \(Q\), that is, \(\mathbf{w}^T P \mathbf{w} > 0\) for all \(\mathbf{w} \neq \mathbf{0}\) such that \(\mathbf{w}^T Q \mathbf{w} = \mathbf{0}\). Then there is a scalar \(\bar{c} \geq 0\) such that \(P + c Q\) is positive definite for all \(c > \bar{c}\). \(\flat\)

Proof: (lemma) We argue by contradiction. Suppose there is an non-negative, increasing, diverging sequence \(\{c_k\}_{k=1}^{+\infty}\) and a sequence of unit vectors \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) such that

\[ (\mathbf{x}^k)^T (P + c_k Q) \mathbf{x}^k \leq 0 \]

for all \(k\). Because the sequence is bounded, it has a limit point \(\bar{\mathbf{x}}\). Assume without loss of generality that \(\mathbf{x}^k \to \bar{\mathbf{x}}\), as \(k \to \infty\). Since \(c_k \to +\infty\) and \((\mathbf{x}^k)^T Q \mathbf{x}^k \geq 0\) by assumption, we must have \((\bar{\mathbf{x}})^T Q \bar{\mathbf{x}} = 0\), otherwise \((\mathbf{x}^k)^T (P + c_k Q) \mathbf{x}^k\) would diverge. Hence, by the assumption in the statement, it must be that \((\bar{\mathbf{x}})^T P \bar{\mathbf{x}} > 0\). This contradicts the inequality in the display above. \(\square\)

Proof: (Lagrange Multipliers: Second-Order Sufficient Conditions) We consider the modified problem

\[\begin{align*} &\text{min} f(\mathbf{x}) + \frac{c}{2} \|\mathbf{h}(\mathbf{x})\|^2\\ &\text{s.t.}\ \mathbf{h}(\mathbf{x}) = \mathbf{0}. \end{align*}\]

It has the same local minimizers as the orginal problem as the additional term in the objective is zero for feasible vectors. That extra term will allow us to use the previous lemma. For notational convenience, we define

\[ g_c(\mathbf{x}) = f(\mathbf{x}) + \frac{c}{2} \|\mathbf{h}(\mathbf{x})\|^2. \]

The Lagrangian of the modified problem is

\[ L_c(\mathbf{x}, \blambda) = g_c(\mathbf{x}) + \mathbf{h}(\mathbf{x})^T \blambda. \]

We will apply the Second-Order Sufficient Conditions to problem of minimizing \(L_c\) over \(\mathbf{x}\). We indicate the Hessian with respect to only the variables \(\mathbf{x}\) as \(\nabla^2_{\mathbf{x},\mathbf{x}}\).

Recall from the proof of the Lagrange Multipliers: First-Order Necessary Conditions that

\[ \nabla \|\mathbf{h}(\mathbf{x})\|^2 = 2 \mathbf{J}_{\mathbf{h}}(\mathbf{x})^T \mathbf{h}(\mathbf{x}). \]

To compute the Hessian of that function, we note

\[\begin{align*} \frac{\partial}{\partial x_i}\left( \frac{\partial}{\partial x_j} \|\mathbf{h}(\mathbf{x})\|^2\right) &= \frac{\partial}{\partial x_i}\left( \sum_{k=1}^\ell 2 h_k(\mathbf{x}) \frac{\partial h_k(\mathbf{x})}{\partial x_j}\right)\\ &= 2 \sum_{k=1}^\ell\left( \frac{\partial h_k(\mathbf{x})}{\partial x_i} \frac{\partial h_k(\mathbf{x})}{\partial x_j} + h_k(\mathbf{x}) \frac{\partial^2 h_k(\mathbf{x})}{\partial x_i \partial x_j} \right)\\ &= 2 \left(\mathbf{J}_{\mathbf{h}}(\mathbf{x})^T \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}) + \sum_{k=1}^\ell h_k(\mathbf{x}) \, \mathbf{H}_{h_k}(\mathbf{x})\right)_{i,j}. \end{align*}\]

Hence

\[ \nabla_{\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) = \nabla f(\mathbf{x}^*) + c \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{h}(\mathbf{x}^*) + \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \blambda^* \]

and

\[ \nabla^2_{\mathbf{x},\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) = \mathbf{H}_{f}(\mathbf{x}^*) + c \left(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) + \sum_{k=1}^\ell h_k(\mathbf{x}^*) \, \mathbf{H}_{h_k}(\mathbf{x}^*)\right) + \sum_{k=1}^\ell \lambda^*_k \, \mathbf{H}_{h_k}(\mathbf{x}^*). \]

By the assumptions of the theorem, this simplifies to

\[ \nabla_{\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) = \mathbf{0} \]

and

\[ \nabla^2_{\mathbf{x},\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) =\underbrace{\left\{ \mathbf{H}_{f}(\mathbf{x}^*) + \sum_{k=1}^\ell \lambda^*_k \, \mathbf{H}_{h_k}(\mathbf{x}^*) \right\}}_{P} + c \underbrace{\left\{\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \right\}}_{Q}, \]

where further \(\mathbf{v}^T P \mathbf{v} > 0\) for any \(\mathbf{v}\) such that \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{v} = \mathbf{0}\) (which itself implies \(\mathbf{v}^T Q \mathbf{v} = \mathbf{0}\)). Furthermore, \(Q \succeq \mathbf{0}\) since

\[ \mathbf{w}^T Q \mathbf{w} = \mathbf{w}^T \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \mathbf{w} = \left\|\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \mathbf{w}\right\|^2 \geq 0 \]

for any \(\mathbf{w} \in \mathbb{R}^d\). The previous lemma allows us to take \(c\) large enough that \(\nabla^2_{\mathbf{x},\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) \succ \mathbf{0}\).

As a result, the unconstrained Second-Order Sufficient Conditions are satisfied at \(\mathbf{x}^*\) for the problem

\[ \min_{\mathbf{x} \in \mathbb{R}^d} L_c(\mathbf{x}, \blambda^*). \]

That is, there is \(\delta > 0\) such that

\[ L_c(\mathbf{x}^*, \blambda^*) < L_c(\mathbf{x}, \blambda^*), \qquad \forall \mathbf{x} \in B_{\delta}(\mathbf{x}^*) \setminus \{\mathbf{x}^*\}. \]

Restricting this to the feasible vectors of the modified constrained problem (i.e., those such that \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\)) implies after simplification

\[ f(\mathbf{x}^*) < f(\mathbf{x}), \qquad \forall \mathbf{x} \in \{\mathbf{x} : \mathbf{h}(\mathbf{x}) = \mathbf{0}\} \cap (B_{\delta}(\mathbf{x}^*) \setminus \{\mathbf{x}^*\}). \]

Therefore, \(\mathbf{x}^*\) is a strict local minimizer of the modified constrained problem (and, in turn, of the original constrained problem). That concludes the proof. \(\square\)