4.3. Variational characterization of eigenvalues#

In this section, we characterize eigenvalues in terms of certain optimization problems. Such variational characterizations are very useful in applications, as we will see later in the chapter. We first give a proof of the spectral theorem where this idea is used explicitly.

4.3.1. Proof of Spectral Theorem#

We will need the following block matrix formula. Suppose

\[\begin{split} A = \begin{pmatrix} A_{11} \\ A_{21} \end{pmatrix} \quad \text{and} \quad B = \begin{pmatrix} B_{11} & B_{12} \end{pmatrix} \end{split}\]

where \(A_{11} \in \mathbb{R}^{n_1\times p}\), \(A_{21} \in \mathbb{R}^{n_2\times p}\), \(B_{11} \in \mathbb{R}^{p\times n_1}\), \(B_{12} \in \mathbb{R}^{p\times n_2}\), then

\[\begin{split} A B = \begin{pmatrix} A_{11} B_{11} & A_{11} B_{12} \\ A_{21} B_{11} & A_{21} B_{12} \end{pmatrix}. \end{split}\]

Indeed this is just a special case of the \(2 \times 2\) block matrix product formula we have encountered previously, with empty blocks \(A_{12}, A_{22}, B_{21}, B_{22}\).

Proof idea (Spectral Theorem): Similarly to how we used Householder transformations to “add zeros under the diagonal”, here we will use a sequence of orthogonal transformations to add zeros both below and above the diagonal. Specifically, we construct a sequence of orthogonal matrices \(\hat{W}_1,\ldots, \hat{W}_d\) such that

\[ \Lambda = W_d^T \cdots W_2^T W_1^T A W_1 W_2 \cdots W_d \]

is diagonal. Then the matrix \(Q\) is simply \(W_1 W_2 \cdots W_d\) (check it!).

To define these matrices, we use a greedy sequence maximizing the quadratic form \(\langle \mathbf{v}, A \mathbf{v}\rangle\). How is this quadratic form related to eigenvalues? Note that, for a unit eigenvector \(\mathbf{v}\) with eigenvalue \(\lambda\), we have \(\langle \mathbf{v}, A \mathbf{v}\rangle = \langle \mathbf{v}, \lambda \mathbf{v}\rangle = \lambda\).

Proof: (Spectral Theorem) We proceed by induction. We have already encountered the idea that eigenvectors are solutions of a contrained optimization problem. The proof in fact uses that idea.

A first eigenvector: Let \(A_1 = A\). Define

\[ \mathbf{v}_1 \in \arg\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\} \]

which exists by the Extreme Value Theorem, and further define

\[ \lambda_1 = \max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}. \]

Complete \(\mathbf{v}_1\) into an orthonormal basis of \(\mathbb{R}^d\), \(\mathbf{v}_1\), \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\), and form the block matrix

\[ \hat{W}_1 = \begin{pmatrix} \mathbf{v}_1 & \hat{V}_1 \end{pmatrix} \]

where the columns of \(\hat{V}_1\) are \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\). Note that \(\hat{W}_1\) is orthogonal by construction.

Getting one step closer to diagonalization: We show next that \(W_1\) gets us one step closer to a diagonal matrix by similarity transformation. Note first that

\[\begin{align*} \hat{W}_1^T A_1 \hat{W}_1 &= \begin{pmatrix} \mathbf{v}_1^T \\ \hat{V}_1^T \end{pmatrix} A_1 \begin{pmatrix} \mathbf{v}_1 & \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \mathbf{v}_1^T \\ \hat{V}_1^T \end{pmatrix} \begin{pmatrix} A_1 \mathbf{v}_1 & A_1 \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \mathbf{v}_1^T A_1 \mathbf{v}_1 & \mathbf{v}_1^T A_1 \hat{V}_1\\ \hat{V}_1^T A_1 \mathbf{v}_1 & \hat{V}_1^T A_1 \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} \end{align*}\]

where \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1\) and \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\).

The key claim is that \(\mathbf{w}_1 = \mathbf{0}\). This follows from a contradiction argument. Indeed, suppose \(\mathbf{w}_1 \neq \mathbf{0}\) and consider the unit vector (check that \(\|\mathbf{z}\| = 1\))

\[\begin{split} \mathbf{z} = \hat{W}_1 \frac{1}{\sqrt{1 + \delta^2 \|\mathbf{w}_1\|^2}} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix} \end{split}\]

which, by the block formula above, achieves the objective value

\[\begin{align*} \mathbf{z}^T A_1 \mathbf{z} &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}^T \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}\\ &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \left( \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1 \right). \end{align*}\]

By the sum of a geometric series, for \(\varepsilon \in (0,1)\),

\[ \frac{1}{1 + \varepsilon^2} = 1 - \varepsilon^2 + \varepsilon^4 + \cdots. \]

So, for \(\delta > 0\) small enough,

\[\begin{align*} \mathbf{z}^T A_1 \mathbf{z} &\approx (\lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1) (1 - \delta^2 \|\mathbf{w}_1\|^2)\\ &\approx \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + C \delta^2\\ &> \lambda_1 \end{align*}\]

where \(C \in \mathbb{R}\) depends on \(\mathbf{w}_1\) and \(A_2\), and where we ignored the “higher-order” terms involving \(\delta^3, \delta^4, \delta^5, \ldots\) which are negligible. That gives the desired contradiction - that is, it would imply the existence of a vector achieving a stricter better objective value than the optimal \(\mathbf{v}_1\).

So, letting \(W_1 = \hat{W}_1\),

\[\begin{split} W_1^T A_1 W_1 = \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix}. \end{split}\]

Finally note that \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\) is symmetric

\[ A_2^T = (\hat{V}_1^T A_1 \hat{V}_1)^T = \hat{V}_1^T A_1^T \hat{V}_1 = \hat{V}_1^T A_1 \hat{V}_1 = A_2 \]

by the symmetry of \(A_1\) itself.

Next step of the induction: Apply the same argument to the symmetric matrix \(A_2 \in \mathbb{R}^{(d-1)\times (d-1)}\), let \(\hat{W}_2 = (\mathbf{v}_2\ \hat{V}_2) \in \mathbb{R}^{(d-1)\times (d-1)}\) be the corresponding orthogonal matrix, and define \(\lambda_2\) and \(A_3\) through the equation

\[\begin{split} \hat{W}_2^T A_2 \hat{W}_2 = \begin{pmatrix} \lambda_2 & \mathbf{0} \\ \mathbf{0} & A_3 \end{pmatrix}. \end{split}\]

Now define the block matrix

\[\begin{split} W_2 = \begin{pmatrix} 1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2 \end{pmatrix}. \end{split}\]

Observe that (check it!)

\[\begin{split} W_2^T W_1^T A_1 W_1 W_2 = W_2^T \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix} W_2 = \begin{pmatrix} \lambda_1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2^T A_2 \hat{W}_2 \end{pmatrix} =\begin{pmatrix} \lambda_1 & 0 & \mathbf{0} \\ 0 & \lambda_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & A_3 \end{pmatrix}. \end{split}\]

Proceeding similarly by induction gives the claim. \(\square\)

A couple remarks about the proof:

1- The fact that \(\mathbf{w}_1 \neq \mathbf{0}\) is perhaps more intuitively understood through vector calculus by using the method of Lagrangian multipliers on \(\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}\) to see that \(A_1 \mathbf{v}_1\) must be proportional to \(\mathbf{v}_1\). Hence \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1 = \mathbf{0}\) by construction of \(\hat{V}_1^T\).

2- By construction, the vector \(\mathbf{v}_2\) (i.e., the first column of \(\hat{W}_2\)) is the solution to

\[ \mathbf{v}_2 \in \arg\max\{\langle \mathbf{v}, A_2 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}. \]

Note that, by definition of \(A_2\) (and the fact that \(A_1 = A\)),

\[ \mathbf{v}^T A_2 \mathbf{v} = \mathbf{v}^T \hat{V}_1^T A_1 \hat{V}_1 \mathbf{v} = (\hat{V}_1 \mathbf{v})^T \,A \,(\hat{V}_1 \mathbf{v}). \]

So we can think of the solution \(\mathbf{v}_2\) as specifying an optimal linear combination of the columns of \(\hat{V}_1\) – which form a basis of the space \(\mathrm{span}(\mathbf{v}_1)^\perp\) of vectors orthogonal to \(\mathbf{v}_1\). In essence \(\mathbf{v}_2\) solves the same problem as \(\mathbf{v}_1\), but restricted to \(\mathrm{span}(\mathbf{v}_1)^\perp\). We will come back to this below.

4.3.2. Varational characterization: important special cases#

We begin with a definition.

DEFINITION (Rayleigh Quotient) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix. The Rayleigh quotient is defined as

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} \]

which is defined for any \(\mathbf{u} \neq \mathbf{0}\) in \(\mathbb{R}^{d}\). \(\natural\)

To start seeing the connection to spectral decomposition, let \(\mathbf{v}\) be a (not necessarily unit) eigenvector of \(A\) with eigenvalue \(\lambda\). One can show that \(\mathcal{R}_A(\mathbf{v}) = \lambda\). (Try it!)

Before stating a general variational characterization, we prove a few important special cases. Throughout, let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\).

The largest eigenvalue: Since \(\mathbf{v}_1, \ldots, \mathbf{v}_d\) forms an orthonormal basis of \(\mathbb{R}^d\), any nonzero vector \(\mathbf{u}\) can be written as \(\mathbf{u} = \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) and it follows that, by the Properties of Orthonormal Lists and the bilinearity of the inner product,

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \|\mathbf{u}\|^2 = \left\|\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\right\|^2 = \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle A \mathbf{v}_i \right\rangle = \left\langle \mathbf{u}, \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \leq \lambda_1 \frac{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_1, \]

where we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_1) = \lambda_1\). So we have established

\[ \lambda_1 = \max_{\mathbf{u} \neq \mathbf{0}} \mathcal{R}_A(\mathbf{u}). \]

The smallest eigenvalue: Arguing in the opposite direction, we get a characterization of the smallest eigenvalue. Using the same notation as before, we have

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_d \frac{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_d, \]

where, again, we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_d) = \lambda_d\). So we have established

\[ \lambda_d = \min_{\mathbf{u} \neq \mathbf{0}} \mathcal{R}_A(\mathbf{u}). \]

The second smallest eigenvalue: To pick out the second smallest eigenvalue, we argue as above but restrict the optimization to the space \(\mathcal{V}_{d-1} = \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1})\). Indeed, if \(\mathbf{u}\) is in the linear subspace \(\mathcal{V}_{d-1}\), it can be written as \(\mathbf{u} = \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\) forms an orthonormal basis of it; why?) and it follows that

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^{d-1} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^{d-1} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_{d-1} \frac{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_{d-1}, \]

where we used \(\lambda_1 \geq \cdots \geq \lambda_{d-1}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\) and of course \(\mathbf{v}_{d-1} \in \mathcal{V}_{d-1}\). So we have established

\[ \lambda_{d-1} = \min_{\mathbf{0} \neq \mathbf{u} \in \mathcal{V}_{d-1}} \mathcal{R}_A(\mathbf{u}). \]

Now what is \(\mathcal{V}_{d-1}\) ? It is spanned by the orthonormal list \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\), each of which is orthogonal to \(\mathbf{v}_d\). So (Why?)

\[ \mathcal{V}_{d-1} = \mathrm{span}(\mathbf{v}_d)^\perp. \]

So, equivalently,

\[ \lambda_{d-1} = \min\left\{\mathcal{R}_A(\mathbf{u})\,:\ \mathbf{u} \neq \mathbf{0}, \langle \mathbf{u}, \mathbf{v}_d\rangle = 0 \right\}. \]

In fact, for any \(\mathbf{u} \neq \mathbf{0}\), we can normalize it by defining \(\mathbf{z} = \mathbf{u}/\|\mathbf{u}\|\) and we note that

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u}\rangle}{\langle \mathbf{u},\mathbf{u}\rangle} = \frac{\langle \mathbf{u}, A \mathbf{u}\rangle}{\|\mathbf{u}\|^2} = \left\langle \frac{\mathbf{u}}{\|\mathbf{u}\|}, A \frac{\mathbf{u}}{\|\mathbf{u}\|}\right\rangle = \langle \mathbf{z}, A \mathbf{z}\rangle. \]

So,

\[ \lambda_{d-1} = \min\left\{\langle \mathbf{z}, A \mathbf{z}\rangle\,:\ \|\mathbf{z}\|=1, \langle \mathbf{z}, \mathbf{v}_d\rangle = 0 \right\}. \]

4.3.3. General statement: Courant-Fischer#

Before stating a general result, we give one more example.

The second smallest eigenvalue (take two): Interestingly, there is a second characterization of the second smallest eigenvalue. Indeed, suppose we restrict the optimization to the space \(\mathcal{W}_{2} = \mathrm{span}(\mathbf{v}_{d-1},\mathbf{v}_d)\) instead. If \(\mathbf{u}\) is in the linear subspace \(\mathcal{W}_{2}\), it can be written as \(\mathbf{u} = \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_{d-1},\mathbf{v}_{d}\) forms an orthonormal basis of it) and it follows that

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=d-1}^{d} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=d-1}^{d} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \leq \lambda_{d-1} \frac{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_{d-1}, \]

where we used \(\lambda_{d-1} \geq \lambda_{d}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\). So we have established

\[ \lambda_{d-1} = \max_{\mathbf{0} \neq \mathbf{u} \in \mathcal{W}_{2}} \mathcal{R}_A(\mathbf{u}). \]

THEOREM (Courant-Fischer) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). For each \(k = 1,\ldots,d\), define the subspace

\[ \mathcal{V}_k = \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) \quad\text{and}\quad \mathcal{W}_{d-k+1} = \mathrm{span}(\mathbf{v}_k, \ldots, \mathbf{v}_d). \]

Then, for all \(k = 1,\ldots,d\),

\[ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}), \]

which are referred to as the local formulas. Furthermore we have the following min-max (or global) formulas, which do not depend on the choice of spectral decomposition, for all \(k = 1,\ldots,d\),

\[ \lambda_k = \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}) = \min_{\mathrm{dim}(\mathcal{W}) = d-k+1} \max_{\mathbf{u} \in \mathcal{W}} \mathcal{R}_A(\mathbf{u}). \]

\(\sharp\)

Proof idea: For the local formula, we expand a vector in \(\mathcal{V}_k\) into the basis \(\mathbf{v}_1,\ldots,\mathbf{v}_k\) and use the fact that \(\mathcal{R}_A(\mathbf{v}_i) = \lambda_i\) and that eigenvalues are in non-increasing order. The global formulas follow from a dimension argument.

The full proof is given in the next (optional) subsection.

Courant-Fischer can be used to recover the special cases of the previous subsection. One also gets new cases of interest. We give a characterization of the third smallest eigenvalue next.

EXAMPLE: (Third Smallest Eigenvalue) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). Using \(k=d-2\) in Courant-Fischer gives

\[ \lambda_{d-2} = \min_{\mathbf{u} \in \mathcal{V}_{d-2}} \mathcal{R}_A(\mathbf{u}), \]

where

\[ \mathcal{V}_{d-2} = \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_{d-2}). \]

It can be seen that

\[ \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_{d-2}) = \mathrm{span}(\mathbf{v}_{d-1}, \mathbf{v}_{d})^\perp. \]

So we finally arrive at the following characterization of the third smallest eigenvalue

\[ \lambda_{d-2} = \min\{\mathcal{R}_A(\mathbf{u})\,:\, \mathbf{0} \neq \mathbf{u}, \langle \mathbf{u},\mathbf{v}_d\rangle = 0, \langle \mathbf{u},\mathbf{v}_{d-1}\rangle = 0\}. \]

Using the same argument as we used for \(\lambda_{d-1}\), we get also

\[ \lambda_{d-2} = \min\left\{\langle \mathbf{z}, A \mathbf{z}\rangle\,:\ \|\mathbf{z}\|=1, \langle \mathbf{z}, \mathbf{v}_d\rangle = 0, \langle \mathbf{z}, \mathbf{v}_{d-1}\rangle = 0 \right\}. \]

\(\lhd\)

We now give a proof of Courant-Fischer. The local formulas follows from the same argument used to derive the special cases above. The global formulas require a new idea.

Proof: We first prove the local formula, that is, the one involving a specific decomposition.

Local formulas: Since \(\mathbf{v}_1, \ldots, \mathbf{v}_k\) form an orthonormal basis of \(\mathcal{V}_k\), any nonzero vector \(\mathbf{u} \in \mathcal{V}_k\) can be written as \(\mathbf{u} = \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) and it follows that

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^k \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^k \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_k \frac{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_k \]

where we used \(\lambda_1 \geq \cdots \geq \lambda_k\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_k) = \lambda_k\). So we have established

\[ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}). \]

The expression in terms of \(\mathcal{W}_{d-k+1}\) is proved similarly.

Global formulas: Since \(\mathcal{V}_k\) has dimension \(k\), it follows from the local formula that

\[ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) \leq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Let \(\mathcal{V}\) be any subspace with dimension \(k\). Because \(\mathcal{W}_{d-k+1}\) has dimension \(d - k + 1\), we have that \(\dim(\mathcal{V}) + \mathrm{dim}(\mathcal{W}_{d-k+1}) > d\) and there must be non-zero vector \(\mathbf{u}_0\) in the intersection \(\mathcal{V} \cap \mathcal{W}_{d-k+1}\) (Prove it!).

We then have by the other local formula that

\[ \lambda_k = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}) \geq \mathcal{R}_A(\mathbf{u}_0) \geq \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Since this inequality holds for any subspace of dimension \(k\), we have

\[ \lambda_k \geq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Combining with inequality in the other direction above gives the claim. The other global formula is proved similarly. \(\square\)