\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)
8.7. Online supplementary material#
8.7.1. Quizzes, solutions, code, etc.#
8.7.1.1. Just the code#
An interactive Jupyter notebook featuring the code in this chapter can be accessed below (Google Colab recommended). You are encouraged to tinker with it. Some suggested computational exercises are scattered throughout. The notebook is also available as a slideshow.
8.7.1.2. Self-assessment quizzes#
A more extensive web version of the self-assessment quizzes is available by following the links below.
8.7.1.3. Auto-quizzes#
Automatically generated quizzes for this chapter can be accessed here (Google Colab recommended).
8.7.1.4. Solutions to odd-numbered warm-up exercises#
(with help from Claude, Gemini, and ChatGPT)
E8.2.1 The vectorization is obtained by stacking the columns of \(A\): \(\text{vec}(A) = (2, 0, 1, -1)\).
E8.2.3
E8.2.5
E8.2.7 First, compute the Jacobian matrices of \(\mathbf{f}\) and \(\mathbf{g}\):
Then, by the Chain Rule,
E8.2.9 From E8.2.5, we have
So,
Now,
So,
We see that \((A \otimes B)^T = A^T \otimes B^T\), as expected from the properties of the Kronecker product.
E8.2.11
So,
E8.2.13 First, compute the gradient of \(f\) and the Jacobian matrix of \(\mathbf{g}\):
Then, by the Chain Rule,
E8.2.15 The Jacobian matrix of \(\mathbf{g}\) is
where \(f'(x) = \cos(x).\) So,
E8.3.1 Each entry of \(AB\) is the dot product of a row of \(A\) and a column of \(B\), which takes 2 multiplications and 1 addition. Since \(AB\) has 4 entries, the total number of operations is \(4 \times 3 = 12\).
E8.3.3 We have
so the partial derivatives are \(\frac{\partial \ell}{\partial \hat{y}_1} = 2 \hat{y}_1\) and \(\frac{\partial \ell}{\partial \hat{y}_2} = 2 \hat{y}_2\) and
E8.3.5 From E8.3.4, we have \(\mathbf{z}_1 = \mathbf{g}_0(\mathbf{z}_0) = (-1, -1)\). Then, \(\mathbf{z}_2 = \mathbf{g}_1(\mathbf{z}_1) = (1, -2)\) and \(f(\mathbf{x}) = \ell(\mathbf{z}_2) = 5\). By the Chain Rule,
E8.3.7 We have
so, by computing all partial derivatives,
Using the notation in the text, \(A_1 = W_1\) and \(B_1 = \mathbf{z}_1^T = I_{1 \times 1} \otimes \mathbf{z}_1^T\).
E8.3.9 We have
so, by the Chain Rule, the partial derivatives are
Moreover, by E8.3.7,
By the fundamental recursion and the results in E8.3.3 and E8.3.8,
E8.4.1 The full gradient descent step is:
The expected SGD step with a batch size of 2 is:
which is equal to the full gradient descent step, as proven in the “Expected SGD Step” lemma.
E8.4.3
E8.4.5 The SGD update is given by
Plugging in the values, we get
E8.4.7
We used the fact that \(\sigma'(z) = \sigma(z)(1-\sigma(z))\).
E8.4.9 First, we compute \(z_1 = Wx = \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix} (1, 2) = (0, 0, 0)\). Then, \(\hat{y} = \gamma(z_1) = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})\). From the text, we have:
E8.4.11 First, we compute the individual gradients:
Then, the full gradient is:
E8.4.13 The cross-entropy loss is given by
E8.5.1 \(\sigma(1) = \frac{1}{1 + e^{-1}} \approx 0.73\) \(\sigma(-1) = \frac{1}{1 + e^{1}} \approx 0.27\) \(\sigma(2) = \frac{1}{1 + e^{-2}} \approx 0.88\)
E8.5.3 \(\bsigma(\mathbf{z}) = (\bsigma(1), \bsigma(-1), \bsigma(2)) \approx (0.73, 0.27, 0.88)\) \(\bsigma'(\mathbf{z}) = (\bsigma'(1), \bsigma'(-1), \bsigma'(2)) \approx (0.20, 0.20, 0.10)\)
E8.5.5 \(W\mathbf{x} = \begin{pmatrix} -1 \\ 4 \end{pmatrix}\), so \(\sigma(W\mathbf{x}) \approx (0.27, 0.98)\), \(J_\bsigma(W\mathbf{x}) = \mathrm{diag}(\bsigma(W\mathbf{x}) \odot (1 - \bsigma(W\mathbf{x}))) \approx \begin{pmatrix} 0.20 & 0 \\ 0 & 0.02 \end{pmatrix}\)
E8.5.7 \(\nabla H(\mathbf{y}, \mathbf{z}) = (-\frac{y_1}{z_1}, -\frac{y_2}{z_2}) = (-\frac{0}{0.3}, -\frac{1}{0.7}) \approx (0, -1.43)\)
8.7.1.5. Learning outcomes#
Define the Jacobian matrix and use it to compute the differential of a vector-valued function.
State and apply the generalized Chain Rule to compute the Jacobian of a composition of functions.
Perform calculations with the Hadamard and Kronecker products of matrices.
Describe the purpose of automatic differentiation and its advantages over symbolic and numerical differentiation.
Implement automatic differentiation in PyTorch to compute gradients of vector-valued functions.
Derive the Chain Rule for multi-layer progressive functions and apply it to compute gradients.
Compare and contrast the forward and reverse modes of automatic differentiation in terms of computational complexity.
Define progressive functions and identify their key properties.
Derive the fundamental recursion for the Jacobian of a progressive function.
Implement the backpropagation algorithm to efficiently compute gradients of progressive functions.
Analyze the computational complexity of the backpropagation algorithm in terms of the number of matrix-vector products.
Describe the stochastic gradient descent (SGD) algorithm and explain how it differs from standard gradient descent.
Derive the update rule for stochastic gradient descent from the gradient of the loss function.
Prove that the expected SGD step is equal to the full gradient descent step.
Evaluate the performance of models trained using stochastic gradient descent on real-world datasets.
Define the multilayer perceptron (MLP) architecture and describe the role of affine maps and activation functions in each layer.
Compute the Jacobian of the sigmoid activation function using properties of diagonal matrices and Kronecker products.
Apply the chain rule to calculate the gradient of the loss function with respect to the weights in a small MLP example.
Generalize the gradient computation for an MLP with an arbitrary number of layers using a forward and backward pass.
Implement the training of a neural network using PyTorch.
\(\aleph\)
8.7.2. Additional sections#
8.7.2.1. Another example: linear regression#
We give another concrete example of progressive functions and of the application of backpropagration and stochastic gradient descent.
Computing the gradient While we have motivated the framework introduced in the previous section from the point of view of classification, it also immediately applies to the regression setting. Both classification and regression are instances of supervised learning.
We first compute the gradient of a single sample. Here \(\mathbf{x} \in \mathbb{R}^d\) again, but \(y\) is a real-valued outcome variable. We revisit the case of linear regression where the loss function is
and the regression function only has input and output layers and no hidden layer (that is, \(L=0\)) with
where \(\mathbf{w} \in \mathbb{R}^{d}\) are the parameters. Recall that we can include a constant term (one that does not depend on the input) by adding a \(1\) to the input. To keep the notation simple, we assume that this pre-processing has already been performed if desired.
Finally, the objective function for a single sample is
Using the notation from the previous subsection, the forward pass in this case is:
Initialization:
Forward layer loop:
where \(\mathbf{w}_0 := \mathbf{w}\).
Loss:
The backward pass is:
Backward layer loop:
Output:
As we noted before, there is in fact no need to compute \(A_0\) and \(\mathbf{p}_0\).
The Advertising
dataset and the least-squares solution We return to the Advertising
dataset.
data = pd.read_csv('advertising.csv')
data.head()
Unnamed: 0 | TV | radio | newspaper | sales | |
---|---|---|---|---|---|
0 | 1 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 2 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 3 | 17.2 | 45.9 | 69.3 | 9.3 |
3 | 4 | 151.5 | 41.3 | 58.5 | 18.5 |
4 | 5 | 180.8 | 10.8 | 58.4 | 12.9 |
n = len(data.index)
print(n)
200
We first compute the solution using the least-squares approach we detailed previously. We use numpy.column_stack
to add a column of ones to the feature vectors.
TV = data['TV'].to_numpy()
radio = data['radio'].to_numpy()
newspaper = data['newspaper'].to_numpy()
sales = data['sales'].to_numpy()
features = np.stack((TV, radio, newspaper), axis=-1)
A = np.column_stack((np.ones(n), features))
coeff = mmids.ls_by_qr(A, sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
The MSE is:
np.mean((A @ coeff - sales)**2)
2.7841263145109365
Solving the problem using PyTorch We will be using PyTorch to implement the previous method. We first convert the data into PyTorch tensors. We then use torch.utils.data.TensorDataset
to create the dataset. Finally, torch.utils.data.DataLoader
provides the utilities to load the data in batches for training. We take mini-batches of size BATCH_SIZE = 64
and we apply a random permutation of the samples on every pass (with the option shuffle=True
).
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
features_tensor = torch.tensor(features, dtype=torch.float32)
sales_tensor = torch.tensor(sales, dtype=torch.float32).view(-1, 1)
BATCH_SIZE = 64
train_dataset = TensorDataset(features_tensor, sales_tensor)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
Now we construct our model. It is simply an affine map from \(\mathbb{R}^3\) to \(\mathbb{R}\). Note that there is no need to pre-process the inputs by adding \(1\)s. A constant term (or “bias variable”) is automatically added by PyTorch (unless one chooses the option bias=False
).
model = nn.Sequential(
nn.Linear(3, 1) # 3 input features, 1 output value
)
Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many optimizers available. (See this post for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. And the loss function is the MSE. A quick tutorial is here.
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-5)
Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here \(10^4\) suffices. But in the interest of time, we will run it only for \(100\) epochs. As you will see from the results, this is not quite enough. On each pass, we compute the output of the current model, use backward()
to obtain the gradient, and then perform a descent update with step()
. We also have to reset the gradients first (otherwise they add up by default).
epochs = 100
for epoch in range(epochs):
for inputs, targets in train_loader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if (epoch+1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")
The final parameters and loss are:
Show code cell source
weights = model[0].weight.detach().numpy()
bias = model[0].bias.detach().numpy()
print("Weights:", weights)
print("Bias:", bias)
Weights: [[0.06328993 0.10348311 0.08792502]]
Bias: [-0.4447368]
Show code cell source
# Evaluate the model
model.eval()
with torch.no_grad():
total_loss = 0
for inputs, targets in train_loader:
outputs = model(inputs)
loss = criterion(outputs, targets)
total_loss += loss.item()
print(f"Mean Squared Error on Training Set: {total_loss / len(train_loader)}")
Mean Squared Error on Training Set: 8.009695529937744
8.7.2.2. Convolutional neural networks#
One can do even better using a neural network tailored for images, known as convolutional neural networks. From Wikipedia:
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.
More background can be found in this excellent module from Stanford’s CS231n. Our CNN will be a composition of convolutional layers and pooling layers.
CHAT & LEARN Convolutional neural networks (CNNs) are powerful for image classification. Ask your favorite AI chatbot to explain the basic concepts of CNNs, including convolutional layers and pooling layers. \(\ddagger\)
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
train_dataset = datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False,
download=True, transform=transforms.ToTensor())
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available()
else ("mps" if torch.backends.mps.is_available()
else "cpu"))
print("Using device:", device)
Using device: mps
The new model is the following.
model = nn.Sequential(
# First convolution, operating upon a 28x28 image
nn.Conv2d(1, 16, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Second convolution, operating upon a 14x14 image
nn.Conv2d(16, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Third convolution, operating upon a 7x7 image
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
# Flatten the tensor
nn.Flatten(),
# Fully connected layer
nn.Linear(32 * 3 * 3, 10),
).to(device)
We train and test.
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)
Epoch 1/3
Epoch 2/3
Epoch 3/3
mmids.test(test_loader, model, loss_fn, device)
Test error: 88.6% accuracy
Note the high accuracy.
Finally, we try the Fashion-MNIST dataset. We use the same CNN.
train_dataset = datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False,
download=True, transform=transforms.ToTensor())
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)
Epoch 1/3
Epoch 2/3
Epoch 3/3
mmids.test(test_loader, model, loss_fn, device)
Test error: 90.0% accuracy
The accuracy is not as high, as this is a more difficult dataset.
8.7.3. Additional proofs#
Proofs of Lagrange Multipliers Conditions We first prove the Lagrange Multipliers: First-Order Necessary Conditions. We follow the excellent textbook [Ber, Section 4.1]. The proof uses the concept of the Jacobian.
Proof idea: We reduce the problem to an unconstrained optimization problem by penalizing the constraint in the objective function. We then apply the unconstrained First-Order Necessary Conditions.
Proof: (Lagrange Multipliers: First-Order Necessary Conditions) We reduce the problem to an unconstrained optimization problem by penalizing the constraint in the objective function. We also add a regularization term to ensure that \(\mathbf{x}^*\) is the unique local minimizer in a neighborhood. Specifically, for each non-negative integer \(k\), consider the objective function
for some positive constant \(\alpha > 0\). Note that as \(k\) gets larger, the penalty becomes more significant and, therefore, enforcing the constraint becomes more desirable. The proof proceeds in several steps.
We first consider a version minimizing \(F^k\) constrained to lie in a neighborhood of \(\mathbf{x}^*\). Because \(\mathbf{x}^*\) is a local minimizer of \(f\) subject to \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\), there is \(\delta > 0\) such that \(f(\mathbf{x}^*)\leq f(\mathbf{x})\) for all feasible \(\mathbf{x}\) within
LEMMA (Step I: Solving the Penalized Problem in a Neighborhood of \(\mathbf{x}^*\)) For \(k \geq 1\), let \(\mathbf{x}^k\) be a global minimizer of the minimization problem
a) The sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) converges to \(\mathbf{x}^*\).
b) For \(k\) sufficiently large, \(\mathbf{x}^k\) is a local minimizer of the objective function \(F^k\) without any constraint.
\(\flat\)
Proof: The set \(\mathscr{X}\) is closed and bounded, and \(F^k\) is continuous. Hence, the sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) is well-defined by the Extreme Value Theorem. Let \(\bar{\mathbf{x}}\) be any limit point of \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\). We show that \(\bar{\mathbf{x}} = \mathbf{x}^*\). That will imply a). It also implied b) since hen, for large enough \(k\), \(\mathbf{x}^k\) must be an interior point of \(\mathscr{X}\).
Let \(-\infty < m \leq M < + \infty\) be the smallest and largest values of \(f\) on \(\mathscr{X}\), which exist by the Extreme Value Theorem. Then, for all \(k\), by definition of \(\mathbf{x}^k\) and the fact that \(\mathbf{x}^*\) is feasible
Rearraning gives
So \(\lim_{k \to \infty} \|\mathbf{h}(\mathbf{x}^k)\|^2 = 0\), which, by the continuity of \(\mathbf{h}\) and of the Frobenius norm, implies that \(\|\mathbf{h}(\bar{\mathbf{x}})\|^2 = 0\), that is, \(\mathbf{h}(\bar{\mathbf{x}}) = \mathbf{0}\). In other words, any limit point \(\bar{\mathbf{x}}\) is feasible.
In addition to being feasible, \(\bar{\mathbf{x}} \in \mathscr{X}\) because that constraint set is closed. So, by the choice of \(\mathscr{X}\), we have \(f(\mathbf{x}^*) \leq f(\bar{\mathbf{x}})\). Furthermore, by \((*)\), we get
This is only possible if \(\|\bar{\mathbf{x}} - \mathbf{x}^*\|^2 = 0\) or, put differently, \(\bar{\mathbf{x}} = \mathbf{x}^*\). That proves the lemma. \(\square\)
LEMMA (Step II: Applying the Unconstrained Necessary Conditions) Let \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) be the sequence in the previous lemma.
a) For sufficiently large \(k\), the vectors \(\nabla h_i(\mathbf{x}^k)\), \(i=1,\ldots,\ell\), are linearly independent.
b) Let \(\mathbf{J}_{\mathbf{h}}(\mathbf{x})\) be the Jacobian matrix of \(\mathbf{h}\), that is, the matrix whose rows are the (row) vectors \(\nabla h_i(\mathbf{x})^T\), \(i=1,\ldots,\ell\). Then
where
\(\flat\)
Proof: By the previous lemma, for \(k\) large enough, \(\mathbf{x}^k\) is an unconstrained local minimizer of \(F^k\). So by the (unconstrained) First-Order Necessary Conditions, it holds that
To compute the gradient of \(F^k\) we note that
The partial derivatives are
by the Chain Rule. So, in vector form,
The term \(\|\mathbf{x} - \mathbf{x}^*\|^2\) can be rewritten as the quadratic function
Using a previous formula with \(P = 2 I_{d \times d}\) (which is symmetric), \(\mathbf{q} = -2 \mathbf{x}^*\) and \(r = (\mathbf{x}^*)^T \mathbf{x}^*\), we get
So, putting everything together,
By the previous lemma, \(\mathbf{x}^k \to \mathbf{x}^*\), \(\nabla f(\mathbf{x}^k) \to \nabla f(\mathbf{x}^*)\), and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\) as \(k \to +\infty\).
So it remains to derive the limit of \(k \mathbf{h}(\mathbf{x}^k)\). By assumption, the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T\) are linearly independent. That implies that for any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\)
otherwise we would have \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} = \mathbf{0}\), contradicting the linear independence assumption. By the Extreme Value Theorem, there is \(\beta > 0\) such that
for all unit vectors \(\mathbf{z} \in \mathbb{R}^\ell\). Since \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\), it follows from a previous lemma that, for \(k\) large enough and any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\),
That implies
so by the same argument as above the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) are linearly independent for \(k\) large enough and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) is invertible. That proves a).
Going back to \((**)\), multiplying both sides by \((\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\, \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T)^{-1} \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\), and taking a limit \(k \to +\infty\), we get after rearranging that
Plugging back we get
as claimed. That proves b). \(\square\)
Combining the lemmas establishes the theorem. \(\square\)
Next, we prove the Lagrange Multipliers: Second-Order Sufficient Conditions. Again, we follow [Ber, Section 4.2]. We will need the following lemma. The proof can be skipped.
Proof idea: We consider a slightly modified problem
which has the same local minimizers. Applying the Second-Order Sufficient Conditions to the Lagrangian of the modified problem gives the result when \(c\) is chosen large enough.
LEMMA Let \(P\) and \(Q\) be symmetric matrices in \(\mathbb{R}^{n \times n}\). Assume that \(Q\) is positive semidefinite and that \(P\) is positive definite on the null space of \(Q\), that is, \(\mathbf{w}^T P \mathbf{w} > 0\) for all \(\mathbf{w} \neq \mathbf{0}\) such that \(\mathbf{w}^T Q \mathbf{w} = \mathbf{0}\). Then there is a scalar \(\bar{c} \geq 0\) such that \(P + c Q\) is positive definite for all \(c > \bar{c}\). \(\flat\)
Proof: (lemma) We argue by contradiction. Suppose there is an non-negative, increasing, diverging sequence \(\{c_k\}_{k=1}^{+\infty}\) and a sequence of unit vectors \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) such that
for all \(k\). Because the sequence is bounded, it has a limit point \(\bar{\mathbf{x}}\). Assume without loss of generality that \(\mathbf{x}^k \to \bar{\mathbf{x}}\), as \(k \to \infty\). Since \(c_k \to +\infty\) and \((\mathbf{x}^k)^T Q \mathbf{x}^k \geq 0\) by assumption, we must have \((\bar{\mathbf{x}})^T Q \bar{\mathbf{x}} = 0\), otherwise \((\mathbf{x}^k)^T (P + c_k Q) \mathbf{x}^k\) would diverge. Hence, by the assumption in the statement, it must be that \((\bar{\mathbf{x}})^T P \bar{\mathbf{x}} > 0\). This contradicts the inequality in the display above. \(\square\)
Proof: (Lagrange Multipliers: Second-Order Sufficient Conditions) We consider the modified problem
It has the same local minimizers as the orginal problem as the additional term in the objective is zero for feasible vectors. That extra term will allow us to use the previous lemma. For notational convenience, we define
The Lagrangian of the modified problem is
We will apply the Second-Order Sufficient Conditions to problem of minimizing \(L_c\) over \(\mathbf{x}\). We indicate the Hessian with respect to only the variables \(\mathbf{x}\) as \(\nabla^2_{\mathbf{x},\mathbf{x}}\).
Recall from the proof of the Lagrange Multipliers: First-Order Necessary Conditions that
To compute the Hessian of that function, we note
Hence
and
By the assumptions of the theorem, this simplifies to
and
where further \(\mathbf{v}^T P \mathbf{v} > 0\) for any \(\mathbf{v}\) such that \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{v} = \mathbf{0}\) (which itself implies \(\mathbf{v}^T Q \mathbf{v} = \mathbf{0}\)). Furthermore, \(Q \succeq \mathbf{0}\) since
for any \(\mathbf{w} \in \mathbb{R}^d\). The previous lemma allows us to take \(c\) large enough that \(\nabla^2_{\mathbf{x},\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) \succ \mathbf{0}\).
As a result, the unconstrained Second-Order Sufficient Conditions are satisfied at \(\mathbf{x}^*\) for the problem
That is, there is \(\delta > 0\) such that
Restricting this to the feasible vectors of the modified constrained problem (i.e., those such that \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\)) implies after simplification
Therefore, \(\mathbf{x}^*\) is a strict local minimizer of the modified constrained problem (and, in turn, of the original constrained problem). That concludes the proof. \(\square\)