\(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\)
6.6. More advanced material#
In this section, we cover some more advanced material.
6.6.1. Additional proofs#
Proof of Second-Order Sufficient Conditions We prove the Second-Order Sufficient Conditions.
We will need a simple lemma.
LEMMA Let \(A = (a_{i,j})_{i,j}\) and \(B = (b_{i,j})_{i,j}\) be matrices in \(\mathbb{R}^{n \times m}\). For any unit vectors \(\mathbf{u} \in \mathbb{R}^n\) and \(\mathbf{v} \in \mathbb{R}^m\)
\(\flat\)
Proof: By Cauchy-Schwarz,
where we used that \(\mathbf{u}\) and \(\mathbf{v}\) have unit norm on the last line. \(\square\)
Proof: By Taylor’s Theorem, for all unit vectors \(\mathbf{v} \in \mathbb{R}^d\) and \(\alpha \in \mathbb{R}\), there is \(\xi_{\alpha} \in (0,1)\) such that
where we used that \(\nabla f(\mathbf{x}_0) = \mathbf{0}\). The second term on the last line is \(0\) at \(\mathbf{v} = \mathbf{0}\). Our goal is to show that it is strictly positive (except at \(\mathbf{0}\)) in a neighborhood of \(\mathbf{0}\).
The set \(\mathbb{S}^{d-1}\) of unit vectors in \(\mathbb{R}^d\) is closed and bounded. The expression \(\mathbf{v}^T \mathbf{H}_f(\mathbf{x}_0) \,\mathbf{v}\), viewed as a function of \(\mathbf{x}\) is continuous since it is a polynomial. Hence, by the Extreme Value Theorem, it attains its minimum on \(\mathbb{S}^{d-1}\). By our assumption that \(\mathbf{H}_f(\mathbf{x}_0)\) is positive definite, that minimum must be strictly positive, say \(\mu > 0\).
By the previous lemma (ignoring the absolute value),
The Frobenius norm above is continuous in \(\mathbf{w}\) as a composition of continuous functions. Moreover, we of course have at \(\mathbf{w} = \mathbf{0}\) that this Frobenius norm is \(0\). Hence, by definition of continuity, for any \(\epsilon > 0\) – say \(\epsilon := \mu/2\) – there is \(\delta > 0\) such that \(\mathbf{w} \in B_{\delta}(\mathbf{0})\) implies \(\|\mathbf{H}_f(\mathbf{x}_0) - \mathbf{H}_f(\mathbf{x}_0 + \mathbf{w})\|_F < \epsilon = \mu/2\).
Since \(\mathbf{v}^T \mathbf{H}_f(\mathbf{x}_0) \,\mathbf{v} > \mu\), the inequality in the previous display implies that
This holds for any unit vector \(\mathbf{v}\) and any \(\mathbf{w} \in B_{\delta}(\mathbf{0})\).
Going back to our Taylor expansion, for \(\alpha > 0\) small enough (not depending on \(\mathbf{v}\)), it holds that \(\mathbf{w} = \xi_\alpha \alpha \mathbf{v} \in B_{\delta}(\mathbf{0})\) so that we get from the previous inequality
Therefore \(\mathbf{x}_0\) is a strict local minimizer. \(\square\)
Proofs of Lagrange Multipliers Conditions We first prove the Lagrange Multipliers: First-Order Necessary Conditions. We follow the excellent textbook [Ber, Section 4.1].
Proof: (Lagrange Multipliers: First-Order Necessary Conditions) We reduce the problem to an unconstrained optimization problem by penalizing the constraint in the objective function. We also add a regularization term to ensure that \(\mathbf{x}^*\) is the unique local minimizer in a neighborhood. Specifically, for each non-negative integer \(k\), consider the objective function
for some positive constant \(\alpha > 0\). Note that as \(k\) gets larger, the penalty becomes more significant and, therefore, enforcing the constraint becomes more desirable. The proof proceeds in several steps.
We first consider a version minimizing \(F^k\) constrained to lie in a neighborhood of \(\mathbf{x}^*\). Because \(\mathbf{x}^*\) is a local minimizer of \(f\) subject to \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\), there is \(\delta > 0\) such that \(f(\mathbf{x}^*)\leq f(\mathbf{x})\) for all feasible \(\mathbf{x}\) within
LEMMA (Step I: Solving the Penalized Problem in a Neighborhood of \(\mathbf{x}^*\)) For \(k \geq 1\), let \(\mathbf{x}^k\) be a global minimizer of the minimization problem
a) The sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) converges to \(\mathbf{x}^*\).
b) For \(k\) sufficiently large, \(\mathbf{x}^k\) is a local minimizer of the objective function \(F^k\) without any constraint.
\(\flat\)
Proof: The set \(\mathscr{X}\) is closed and bounded, and \(F^k\) is continuous. Hence, the sequence \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) is well-defined by the Extreme Value Theorem. Let \(\bar{\mathbf{x}}\) be any limit point of \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\). We show that \(\bar{\mathbf{x}} = \mathbf{x}^*\). That will imply a). It also implied b) since hen, for large enough \(k\), \(\mathbf{x}^k\) must be an interior point of \(\mathscr{X}\).
Let \(-\infty < m \leq M < + \infty\) be the smallest and largest values of \(f\) on \(\mathscr{X}\), which exist by the Extreme Value Theorem. Then, for all \(k\), by definition of \(x^k\) and the fact that \(\mathbf{x}^*\) is feasible
Rearraning gives
So \(\lim_{k \to \infty} \|\mathbf{h}(\mathbf{x}^k)\|^2 = 0\), which, by the continuity of \(\mathbf{h}\) and of the Frobenius norm, implies that \(\|\mathbf{h}(\bar{\mathbf{x}})\|^2 = 0\), that is, \(\mathbf{h}(\bar{\mathbf{x}}) = \mathbf{0}\). In other words, any limit point \(\bar{\mathbf{x}}\) is feasible.
In addition to being feasible, \(\bar{\mathbf{x}} \in \mathscr{X}\) because that constraint set is closed. So, by the choice of \(\mathscr{X}\), we have \(f(\mathbf{x}^*) \leq f(\bar{\mathbf{x}})\). Furthermore, by \((*)\), we get
This is only possible if \(\|\bar{\mathbf{x}} - \mathbf{x}^*\|^2 = 0\) or, put differently, \(\bar{\mathbf{x}} = \mathbf{x}^*\). That proves the lemma. \(\square\)
LEMMA (Step II: Applying the Unconstrained Necessary Conditions) Let \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) be the sequence in the previous lemma.
a) For sufficiently large \(k\), the vectors \(\nabla h_i(\mathbf{x}^k)\), \(i=1,\ldots,\ell\), are linearly independent.
b) Let \(\mathbf{J}_{\mathbf{h}}(\mathbf{x})\) be the Jacobian matrix of \(\mathbf{h}\), that is, the matrix whose rows are the (row) vectors \(\nabla h_i(\mathbf{x})^T\), \(i=1,\ldots,\ell\). Then
where
\(\flat\)
Proof: By the previous lemma, for \(k\) large enough, \(\mathbf{x}^k\) is an unconstrained local minimizer of \(F^k\). So by the (unconstrained) First-Order Necessary Conditions, it holds that
To compute the gradient of \(F^k\) we note that
The partial derivatives are
by the Chain Rule. So, in vector form,
The term \(\|\mathbf{x} - \mathbf{x}^*\|^2\) can be rewritten as the quadratic function
Using a previous formula with \(P = 2 I_{d \times d}\) (which is symmetric), \(\mathbf{q} = -2 \mathbf{x}^*\) and \(r = (\mathbf{x}^*)^T \mathbf{x}^*\), we get
So, putting everything together,
By the previous lemma, \(\mathbf{x}^k \to \mathbf{x}^*\), \(\nabla f(\mathbf{x}^k) \to \nabla f(\mathbf{x}^*)\), and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\) as \(k \to +\infty\).
So it remains to derive the limit of \(k \mathbf{h}(\mathbf{x}^k)\). By assumption, the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T\) are linearly independent. That implies that for any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\)
otherwise we would have \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)^T \mathbf{z} = \mathbf{0}\), contradicting the linear independence assumption. By the Extreme Value Theorem, there is \(\beta > 0\) such that
for all unit vectors \(\mathbf{z} \in \mathbb{R}^\ell\). Since \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \to \mathbf{J}_{\mathbf{h}}(\mathbf{x}^*)\), it follows from a previous lemma that, for \(k\) large enough and any unit vector \(\mathbf{z} \in \mathbb{R}^\ell\),
That implies
so by the same argument as above the columns of \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) are linearly independent for \(k\) large enough and \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k) \,\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T\) is invertible. That proves a).
Going back to \((**)\), multiplying both sides by \((\mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\, \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)^T)^{-1} \mathbf{J}_{\mathbf{h}}(\mathbf{x}^k)\), and taking a limit \(k \to +\infty\), we get after rearranging that
Plugging back we get
as claimed. That proves b). \(\square\)
Combining the lemmas establishes the theorem. \(\square\)
Next, we prove the Lagrange Multipliers: Second-Order Sufficient Conditions. Again, We follow [Ber, Section 4.2]. We will need the following lemma. The proof can be skipped.
LEMMA Let \(P\) and \(Q\) be symmetric matrices in \(\mathbb{R}^{n \times n}\). Assume that \(Q\) is positive semidefinite and that \(P\) is positive definite on the null space of \(Q\), that is, \(\mathbf{w}^T P \mathbf{w} > 0\) for all \(\mathbf{w} \neq \mathbf{0}\) such that \(\mathbf{w}^T Q \mathbf{w} = \mathbf{0}\). Then there is a scalar \(\bar{c} \geq 0\) such that \(P + c Q\) is positive definite for all \(c > \bar{c}\). \(\flat\)
Proof: (lemma) We argue by contradiction. Suppose there is an non-negative, increasing, diverging sequence \(\{c_k\}_{k=1}^{+\infty}\) and a sequence of unit vectors \(\{\mathbf{x}^k\}_{k=1}^{+\infty}\) such that
for all \(k\). Because the sequence is bounded, it has a limit point \(\bar{\mathbf{x}}\). Assume without loss of generality that \(\mathbf{x}^k \to \bar{\mathbf{x}}\), as \(k \to \infty\). Since \(c_k \to +\infty\) and \((\mathbf{x}^k)^T Q \mathbf{x}^k \geq 0\) by assumption, we must have \((\bar{\mathbf{x}})^T Q \bar{\mathbf{x}} = 0\), otherwise \((\mathbf{x}^k)^T (P + c_k Q) \mathbf{x}^k \) would diverge. Hence, by the assumption in the statement, it must be that \((\bar{\mathbf{x}})^T P \bar{\mathbf{x}} > 0\). This contradicts the inequality in the display above. \(\square\)
Proof: (Lagrange Multipliers: Second-Order Sufficient Conditions) We consider the modified problem
It has the same local minimizers as the orginal problem as the additional term in the objective is zero for feasible vectors. That extra term will allow us to use the previous lemma. For notational convenience, we define
The Lagrangian of the modified problem is
We will apply the Second-Order Sufficient Conditions to problem of minimizing \(L_c\) over \(\mathbf{x}\). We indicate the Hessian with respect to only the variables \(\mathbf{x}\) as \(\nabla^2_{\mathbf{x},\mathbf{x}}\).
Recall from the proof of the Lagrange Multipliers: First-Order Necessary Conditions that
To compute the Hessian of that function, we note
Hence
and
By the assumptions of the theorem, this simplifies to
and
where further \(\mathbf{v}^T P \mathbf{v} > 0\) for any \(\mathbf{v}\) such that \(\mathbf{J}_{\mathbf{h}}(\mathbf{x}^*) \,\mathbf{v} = \mathbf{0}\) (which itself implies \(\mathbf{v}^T Q \mathbf{v} = \mathbf{0}\)). Furthermore, \(Q \succeq \mathbf{0}\) since
for any \(\mathbf{w} \in \mathbb{R}^d\). The previous lemma allows us to take \(c\) large enough that \(\nabla^2_{\mathbf{x},\mathbf{x}} L_c(\mathbf{x}^*, \blambda^*) \succ \mathbf{0}\).
As a result, the unconstrained Second-Order Sufficient Conditions are satisfied at \(\mathbf{x}^*\) for the problem
That is, there is \(\delta > 0\) such that
Restricting this to the feasible vectors of the modified constrained problem (i.e., those such that \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\)) implies after simplification
Therefore, \(\mathbf{x}^*\) is a strict local minimizer of the modified constrained problem (and, in turn, of the original constrained problem). That concludes the proof. \(\square\)