4.3. Variational characterization of eigenvalues#
In this section, we characterize eigenvalues in terms of certain optimization problems. Such variational characterizations are very useful in applications, as we will see later in the chapter. We first give a proof of the spectral theorem where this idea is used explicitly.
4.3.1. Proof of Spectral Theorem#
We will need the following block matrix formula. Suppose
where \(A_{11} \in \mathbb{R}^{n_1\times p}\), \(A_{21} \in \mathbb{R}^{n_2\times p}\), \(B_{11} \in \mathbb{R}^{p\times n_1}\), \(B_{12} \in \mathbb{R}^{p\times n_2}\), then
Indeed this is just a special case of the \(2 \times 2\) block matrix product formula we have encountered previously, with empty blocks \(A_{12}, A_{22}, B_{21}, B_{22}\).
Proof idea (Spectral Theorem): Similarly to how we used Householder transformations to “add zeros under the diagonal”, here we will use a sequence of orthogonal transformations to add zeros both below and above the diagonal. Specifically, we construct a sequence of orthogonal matrices \(\hat{W}_1,\ldots, \hat{W}_d\) such that
is diagonal. Then the matrix \(Q\) is simply \(W_1 W_2 \cdots W_d\) (check it!).
To define these matrices, we use a greedy sequence maximizing the quadratic form \(\langle \mathbf{v}, A \mathbf{v}\rangle\). How is this quadratic form related to eigenvalues? Note that, for a unit eigenvector \(\mathbf{v}\) with eigenvalue \(\lambda\), we have \(\langle \mathbf{v}, A \mathbf{v}\rangle = \langle \mathbf{v}, \lambda \mathbf{v}\rangle = \lambda\).
Proof: (Spectral Theorem) We proceed by induction. We have already encountered the idea that eigenvectors are solutions of a contrained optimization problem. The proof in fact uses that idea.
A first eigenvector: Let \(A_1 = A\). Define
which exists by the Extreme Value Theorem, and further define
Complete \(\mathbf{v}_1\) into an orthonormal basis of \(\mathbb{R}^d\), \(\mathbf{v}_1\), \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\), and form the block matrix
where the columns of \(\hat{V}_1\) are \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\). Note that \(\hat{W}_1\) is orthogonal by construction.
Getting one step closer to diagonalization: We show next that \(W_1\) gets us one step closer to a diagonal matrix by similarity transformation. Note first that
where \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1\) and \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\).
The key claim is that \(\mathbf{w}_1 = \mathbf{0}\). This follows from a contradiction argument. Indeed, suppose \(\mathbf{w}_1 \neq \mathbf{0}\) and consider the unit vector (check that \(\|\mathbf{z}\| = 1\))
which, by the block formula above, achieves the objective value
By the sum of a geometric series, for \(\varepsilon \in (0,1)\),
So, for \(\delta > 0\) small enough,
where \(C \in \mathbb{R}\) depends on \(\mathbf{w}_1\) and \(A_2\), and where we ignored the “higher-order” terms involving \(\delta^3, \delta^4, \delta^5, \ldots\) which are negligible. That gives the desired contradiction - that is, it would imply the existence of a vector achieving a stricter better objective value than the optimal \(\mathbf{v}_1\).
So, letting \(W_1 = \hat{W}_1\),
Finally note that \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\) is symmetric
by the symmetry of \(A_1\) itself.
Next step of the induction: Apply the same argument to the symmetric matrix \(A_2 \in \mathbb{R}^{(d-1)\times (d-1)}\), let \(\hat{W}_2 = (\mathbf{v}_2\ \hat{V}_2) \in \mathbb{R}^{(d-1)\times (d-1)}\) be the corresponding orthogonal matrix, and define \(\lambda_2\) and \(A_3\) through the equation
Now define the block matrix
Observe that (check it!)
Proceeding similarly by induction gives the claim. \(\square\)
A couple remarks about the proof:
1- The fact that \(\mathbf{w}_1 \neq \mathbf{0}\) is perhaps more intuitively understood through vector calculus by using the method of Lagrangian multipliers on \(\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}\) to see that \(A_1 \mathbf{v}_1\) must be proportional to \(\mathbf{v}_1\). Hence \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1 = \mathbf{0}\) by construction of \(\hat{V}_1^T\).
2- By construction, the vector \(\mathbf{v}_2\) (i.e., the first column of \(\hat{W}_2\)) is the solution to
Note that, by definition of \(A_2\) (and the fact that \(A_1 = A\)),
So we can think of the solution \(\mathbf{v}_2\) as specifying an optimal linear combination of the columns of \(\hat{V}_1\) – which form a basis of the space \(\mathrm{span}(\mathbf{v}_1)^\perp\) of vectors orthogonal to \(\mathbf{v}_1\). In essence \(\mathbf{v}_2\) solves the same problem as \(\mathbf{v}_1\), but restricted to \(\mathrm{span}(\mathbf{v}_1)^\perp\). We will come back to this below.
4.3.2. Varational characterization: important special cases#
We begin with a definition.
DEFINITION (Rayleigh Quotient) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix. The Rayleigh quotient is defined as
which is defined for any \(\mathbf{u} \neq \mathbf{0}\) in \(\mathbb{R}^{d}\). \(\natural\)
To start seeing the connection to spectral decomposition, let \(\mathbf{v}\) be a (not necessarily unit) eigenvector of \(A\) with eigenvalue \(\lambda\). One can show that \(\mathcal{R}_A(\mathbf{v}) = \lambda\). (Try it!)
Before stating a general variational characterization, we prove a few important special cases. Throughout, let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\).
The largest eigenvalue: Since \(\mathbf{v}_1, \ldots, \mathbf{v}_d\) forms an orthonormal basis of \(\mathbb{R}^d\), any nonzero vector \(\mathbf{u}\) can be written as \(\mathbf{u} = \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) and it follows that, by the Properties of Orthonormal Lists and the bilinearity of the inner product,
Thus,
where we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_1) = \lambda_1\). So we have established
The smallest eigenvalue: Arguing in the opposite direction, we get a characterization of the smallest eigenvalue. Using the same notation as before, we have
where, again, we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_d) = \lambda_d\). So we have established
The second smallest eigenvalue: To pick out the second smallest eigenvalue, we argue as above but restrict the optimization to the space \(\mathcal{V}_{d-1} = \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1})\). Indeed, if \(\mathbf{u}\) is in the linear subspace \(\mathcal{V}_{d-1}\), it can be written as \(\mathbf{u} = \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\) forms an orthonormal basis of it; why?) and it follows that
Thus,
where we used \(\lambda_1 \geq \cdots \geq \lambda_{d-1}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\) and of course \(\mathbf{v}_{d-1} \in \mathcal{V}_{d-1}\). So we have established
Now what is \(\mathcal{V}_{d-1}\) ? It is spanned by the orthonormal list \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\), each of which is orthogonal to \(\mathbf{v}_d\). So (Why?)
So, equivalently,
In fact, for any \(\mathbf{u} \neq \mathbf{0}\), we can normalize it by defining \(\mathbf{z} = \mathbf{u}/\|\mathbf{u}\|\) and we note that
So,
4.3.3. General statement: Courant-Fischer#
Before stating a general result, we give one more example.
The second smallest eigenvalue (take two): Interestingly, there is a second characterization of the second smallest eigenvalue. Indeed, suppose we restrict the optimization to the space \(\mathcal{W}_{2} = \mathrm{span}(\mathbf{v}_{d-1},\mathbf{v}_d)\) instead. If \(\mathbf{u}\) is in the linear subspace \(\mathcal{W}_{2}\), it can be written as \(\mathbf{u} = \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_{d-1},\mathbf{v}_{d}\) forms an orthonormal basis of it) and it follows that
Thus,
where we used \(\lambda_{d-1} \geq \lambda_{d}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\). So we have established
THEOREM (Courant-Fischer) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). For each \(k = 1,\ldots,d\), define the subspace
Then, for all \(k = 1,\ldots,d\),
which are referred to as the local formulas. Furthermore we have the following min-max (or global) formulas, which do not depend on the choice of spectral decomposition, for all \(k = 1,\ldots,d\),
\(\sharp\)
Proof idea: For the local formula, we expand a vector in \(\mathcal{V}_k\) into the basis \(\mathbf{v}_1,\ldots,\mathbf{v}_k\) and use the fact that \(\mathcal{R}_A(\mathbf{v}_i) = \lambda_i\) and that eigenvalues are in non-increasing order. The global formulas follow from a dimension argument.
The full proof is given in the next (optional) subsection.
Courant-Fischer can be used to recover the special cases of the previous subsection. One also gets new cases of interest. We give a characterization of the third smallest eigenvalue next.
EXAMPLE: (Third Smallest Eigenvalue) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). Using \(k=d-2\) in Courant-Fischer gives
where
It can be seen that
So we finally arrive at the following characterization of the third smallest eigenvalue
Using the same argument as we used for \(\lambda_{d-1}\), we get also
\(\lhd\)
We now give a proof of Courant-Fischer. The local formulas follows from the same argument used to derive the special cases above. The global formulas require a new idea.
Proof: We first prove the local formula, that is, the one involving a specific decomposition.
Local formulas: Since \(\mathbf{v}_1, \ldots, \mathbf{v}_k\) form an orthonormal basis of \(\mathcal{V}_k\), any nonzero vector \(\mathbf{u} \in \mathcal{V}_k\) can be written as \(\mathbf{u} = \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) and it follows that
Thus,
where we used \(\lambda_1 \geq \cdots \geq \lambda_k\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_k) = \lambda_k\). So we have established
The expression in terms of \(\mathcal{W}_{d-k+1}\) is proved similarly.
Global formulas: Since \(\mathcal{V}_k\) has dimension \(k\), it follows from the local formula that
Let \(\mathcal{V}\) be any subspace with dimension \(k\). Because \(\mathcal{W}_{d-k+1}\) has dimension \(d - k + 1\), we have that \(\dim(\mathcal{V}) + \mathrm{dim}(\mathcal{W}_{d-k+1}) > d\) and there must be non-zero vector \(\mathbf{u}_0\) in the intersection \(\mathcal{V} \cap \mathcal{W}_{d-k+1}\) (Prove it!).
We then have by the other local formula that
Since this inequality holds for any subspace of dimension \(k\), we have
Combining with inequality in the other direction above gives the claim. The other global formula is proved similarly. \(\square\)