\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

5.3. Variational characterization of eigenvalues#

In this section, we characterize eigenvalues in terms of certain optimization problems, an idea we have already encountered in the closely related context of the SVD. Such variational characterizations are very useful in applications, as we will see later in the chapter. We first give a proof of the spectral theorem where this idea is used explicitly.

5.3.1. Proof of Spectral Theorem#

We will need the following block matrix formula. Suppose

\[\begin{split} A = \begin{pmatrix} A_{11} \\ A_{21} \end{pmatrix} \quad \text{and} \quad B = \begin{pmatrix} B_{11} & B_{12} \end{pmatrix} \end{split}\]

where \(A_{11} \in \mathbb{R}^{n_1\times p}\), \(A_{21} \in \mathbb{R}^{n_2\times p}\), \(B_{11} \in \mathbb{R}^{p\times n_1}\), \(B_{12} \in \mathbb{R}^{p\times n_2}\), then

\[\begin{split} A B = \begin{pmatrix} A_{11} B_{11} & A_{11} B_{12} \\ A_{21} B_{11} & A_{21} B_{12} \end{pmatrix}. \end{split}\]

Indeed this is just a special case of the \(2 \times 2\) block matrix product formula we have encountered previously, with empty blocks \(A_{12}, A_{22}, B_{21}, B_{22}\).

Proof idea (Spectral Theorem): Similarly to how we used Householder transformations to “add zeros under the diagonal”, here we will use a sequence of orthogonal transformations to add zeros both below and above the diagonal. Recall that two matrices \(C, D \in \mathbb{R}^{d \times d}\) are similar if there is an invertible matrix \(P\) such that \(C = P^{-1} D P\), and that, when \(P = W\) is an orthogonal matrix, the transformation simplifies to \(C = W^T D W\).

We construct a sequence of orthogonal matrices \(\hat{W}_1,\ldots, \hat{W}_d\) and progressively compute \(W_1^T A W_1\), \(W_2^T W_1^T A W_1 W_2\), and so on in such a way that

\[ \Lambda = W_d^T \cdots W_2^T W_1^T A W_1 W_2 \cdots W_d \]

is diagonal. Then the matrix \(Q\) is simply \(W_1 W_2 \cdots W_d\) (check it!).

To define these matrices, we use a greedy sequence maximizing the quadratic form \(\langle \mathbf{v}, A \mathbf{v}\rangle\). How is this quadratic form related to eigenvalues? Note that, for a unit eigenvector \(\mathbf{v}\) with eigenvalue \(\lambda\), we have \(\langle \mathbf{v}, A \mathbf{v}\rangle = \langle \mathbf{v}, \lambda \mathbf{v}\rangle = \lambda\).

Proof: (Spectral Theorem) \(\idx{spectral theorem}\xdi\) We proceed by induction. We have already encountered the idea that eigenvectors are solutions of a contrained optimization problem. The proof in fact uses that idea.

A first eigenvector: Let \(A_1 = A\). Define

\[ \mathbf{v}_1 \in \arg\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\} \]

which exists by the Extreme Value Theorem, and further define

\[ \lambda_1 = \max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}. \]

Complete \(\mathbf{v}_1\) into an orthonormal basis of \(\mathbb{R}^d\), \(\mathbf{v}_1\), \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\), and form the block matrix

\[ \hat{W}_1 = \begin{pmatrix} \mathbf{v}_1 & \hat{V}_1 \end{pmatrix} \]

where the columns of \(\hat{V}_1\) are \(\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d\). Note that \(\hat{W}_1\) is orthogonal by construction.

Getting one step closer to diagonalization: We show next that \(W_1\) gets us one step closer to a diagonal matrix by similarity transformation. Note first that

\[\begin{align*} \hat{W}_1^T A_1 \hat{W}_1 &= \begin{pmatrix} \mathbf{v}_1^T \\ \hat{V}_1^T \end{pmatrix} A_1 \begin{pmatrix} \mathbf{v}_1 & \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \mathbf{v}_1^T \\ \hat{V}_1^T \end{pmatrix} \begin{pmatrix} A_1 \mathbf{v}_1 & A_1 \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \mathbf{v}_1^T A_1 \mathbf{v}_1 & \mathbf{v}_1^T A_1 \hat{V}_1\\ \hat{V}_1^T A_1 \mathbf{v}_1 & \hat{V}_1^T A_1 \hat{V}_1 \end{pmatrix}\\ &= \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} \end{align*}\]

where \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1\) and \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\).

The key claim is that \(\mathbf{w}_1 = \mathbf{0}\). This follows from a contradiction argument. Indeed, suppose \(\mathbf{w}_1 \neq \mathbf{0}\) and consider the unit vector (check that \(\|\mathbf{z}\| = 1\))

\[\begin{split} \mathbf{z} = \hat{W}_1 \frac{1}{\sqrt{1 + \delta^2 \|\mathbf{w}_1\|^2}} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix} \end{split}\]

which, by the block formula above, achieves the objective value

\[\begin{align*} \mathbf{z}^T A_1 \mathbf{z} &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}^T \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}\\ &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \left( \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1 \right). \end{align*}\]

By the sum of a geometric series, for \(\varepsilon \in (0,1)\),

\[ \frac{1}{1 + \varepsilon^2} = 1 - \varepsilon^2 + \varepsilon^4 + \cdots. \]

So, for \(\delta > 0\) small enough,

\[\begin{align*} \mathbf{z}^T A_1 \mathbf{z} &\approx (\lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1) (1 - \delta^2 \|\mathbf{w}_1\|^2)\\ &\approx \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + C \delta^2\\ &> \lambda_1 \end{align*}\]

where \(C \in \mathbb{R}\) depends on \(\mathbf{w}_1\) and \(A_2\), and where we ignored the “higher-order” terms involving \(\delta^3, \delta^4, \delta^5, \ldots\) whose overall contribution is negligible when \(\delta\) is small. That gives the desired contradiction – that is, it would imply the existence of a vector achieving a strictly better objective value than the optimal \(\mathbf{v}_1\).

So, letting \(W_1 = \hat{W}_1\),

\[\begin{split} W_1^T A_1 W_1 = \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix}. \end{split}\]

Finally note that \(A_2 = \hat{V}_1^T A_1 \hat{V}_1\) is symmetric

\[ A_2^T = (\hat{V}_1^T A_1 \hat{V}_1)^T = \hat{V}_1^T A_1^T \hat{V}_1 = \hat{V}_1^T A_1 \hat{V}_1 = A_2 \]

by the symmetry of \(A_1\) itself.

Next step of the induction: Apply the same argument to the symmetric matrix \(A_2 \in \mathbb{R}^{(d-1)\times (d-1)}\), let \(\hat{W}_2 = (\mathbf{v}_2\ \hat{V}_2) \in \mathbb{R}^{(d-1)\times (d-1)}\) be the corresponding orthogonal matrix, and define \(\lambda_2\) and \(A_3\) through the equation

\[\begin{split} \hat{W}_2^T A_2 \hat{W}_2 = \begin{pmatrix} \lambda_2 & \mathbf{0} \\ \mathbf{0} & A_3 \end{pmatrix}. \end{split}\]

Now define the block matrix

\[\begin{split} W_2 = \begin{pmatrix} 1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2 \end{pmatrix}. \end{split}\]

Observe that (check it!)

\[\begin{split} W_2^T W_1^T A_1 W_1 W_2 = W_2^T \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix} W_2 = \begin{pmatrix} \lambda_1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2^T A_2 \hat{W}_2 \end{pmatrix} =\begin{pmatrix} \lambda_1 & 0 & \mathbf{0} \\ 0 & \lambda_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & A_3 \end{pmatrix}. \end{split}\]

Proceeding similarly by induction gives the claim. \(\square\)

A couple remarks about the proof:

1- The fact that \(\mathbf{w}_1 \neq \mathbf{0}\) is perhaps more intuitively understood through vector calculus by using the method of Lagrangian multipliers on \(\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}\) to see that \(A_1 \mathbf{v}_1\) must be proportional to \(\mathbf{v}_1\). Hence \(\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1 = \mathbf{0}\) by construction of \(\hat{V}_1^T\). Indeed, define the Lagrangian function

\[ L(\mathbf{v}, \lambda) = \langle \mathbf{v}, A_1 \mathbf{v}\rangle - \lambda(\|\mathbf{v}\|^2 - 1). \]

The first-order necessary conditions for a local maximizer \(\mathbf{v}_1\) are

\[ \nabla_{\mathbf{v}} L(\mathbf{v}_1, \lambda_1) = 2A_1 \mathbf{v}_1 - 2\lambda_1 \mathbf{v}_1 = \mathbf{0} \]
\[ \nabla_{\lambda} L(\mathbf{v}_1, \lambda_1) = \|\mathbf{v}_1\|^2 - 1 = 0. \]

From the first condition, we have

\[ A_1 \mathbf{v}_1 = \lambda_1 \mathbf{v}_1. \]

This shows that \(A_1 \mathbf{v}_1\) is proportional to \(\mathbf{v}_1\) as claimed.

2- By construction, the vector \(\mathbf{v}_2\) (i.e., the first column of \(\hat{W}_2\)) is the solution to

\[ \mathbf{v}_2 \in \arg\max\{\langle \mathbf{v}, A_2 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}. \]

Note that, by definition of \(A_2\) (and the fact that \(A_1 = A\)),

\[ \mathbf{v}^T A_2 \mathbf{v} = \mathbf{v}^T \hat{V}_1^T A_1 \hat{V}_1 \mathbf{v} = (\hat{V}_1 \mathbf{v})^T \,A \,(\hat{V}_1 \mathbf{v}). \]

So we can think of the solution \(\mathbf{v}_2\) as specifying an optimal linear combination of the columns of \(\hat{V}_1\) – which form a basis of the space \(\mathrm{span}(\mathbf{v}_1)^\perp\) of vectors orthogonal to \(\mathbf{v}_1\). In essence \(\mathbf{v}_2\) solves the same problem as \(\mathbf{v}_1\), but restricted to \(\mathrm{span}(\mathbf{v}_1)^\perp\). We will come back to this below.

5.3.2. Varational characterization: special cases#

We begin with a definition.

DEFINITION (Rayleigh Quotient) \(\idx{Rayleigh quotient}\xdi\) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix. The Rayleigh quotient is defined as

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} \]

which is defined for any \(\mathbf{u} \neq \mathbf{0}\) in \(\mathbb{R}^{d}\). \(\natural\)

To start seeing the connection to the spectral decomposition, let \(\mathbf{v}\) be a (not necessarily unit) eigenvector of \(A\) with eigenvalue \(\lambda\). One can show that \(\mathcal{R}_A(\mathbf{v}) = \lambda\). (Try it!) In fact, recall that we have previously shown that eigenvectors of \(A\) are stationary points of \(\mathcal{R}_A\).

Before stating a general variational characterization, we prove a few special cases. Throughout, let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\).

The largest eigenvalue: Since \(\mathbf{v}_1, \ldots, \mathbf{v}_d\) forms an orthonormal basis of \(\mathbb{R}^d\), any nonzero vector \(\mathbf{u}\) can be written as \(\mathbf{u} = \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) and it follows that, by the Properties of Orthonormal Lists and the bilinearity of the inner product,

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \|\mathbf{u}\|^2 = \left\|\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\right\|^2 = \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle A \mathbf{v}_i \right\rangle = \left\langle \mathbf{u}, \sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \leq \lambda_1 \frac{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_1, \]

where we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_1) = \lambda_1\). So we have established

\[ \lambda_1 = \max_{\mathbf{u} \neq \mathbf{0}} \mathcal{R}_A(\mathbf{u}). \]

The smallest eigenvalue: Arguing in the opposite direction, we get a characterization of the smallest eigenvalue. Using the same notation as before, we have

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^d \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_d \frac{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^d \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_d, \]

where, again, we used \(\lambda_1 \geq \cdots \geq \lambda_d\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_d) = \lambda_d\). So we have established

\[ \lambda_d = \min_{\mathbf{u} \neq \mathbf{0}} \mathcal{R}_A(\mathbf{u}). \]

The second smallest eigenvalue: To pick out the second smallest eigenvalue, we argue as above but restrict the optimization to the space \(\mathcal{V}_{d-1} = \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1})\). Indeed, if \(\mathbf{u}\) is in the linear subspace \(\mathcal{V}_{d-1}\), it can be written as \(\mathbf{u} = \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\) forms an orthonormal basis of it; why?) and it follows that

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^{d-1} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^{d-1} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_{d-1} \frac{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^{d-1} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_{d-1}, \]

where we used \(\lambda_1 \geq \cdots \geq \lambda_{d-1}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\) and of course \(\mathbf{v}_{d-1} \in \mathcal{V}_{d-1}\). So we have established

\[ \lambda_{d-1} = \min_{\mathbf{0} \neq \mathbf{u} \in \mathcal{V}_{d-1}} \mathcal{R}_A(\mathbf{u}). \]

Now what is \(\mathcal{V}_{d-1}\) ? It is spanned by the orthonormal list \(\mathbf{v}_1,\ldots,\mathbf{v}_{d-1}\), each of which is orthogonal to \(\mathbf{v}_d\). So (Why?)

\[ \mathcal{V}_{d-1} = \mathrm{span}(\mathbf{v}_d)^\perp. \]

So, equivalently,

\[ \lambda_{d-1} = \min\left\{\mathcal{R}_A(\mathbf{u})\,:\ \mathbf{u} \neq \mathbf{0}, \langle \mathbf{u}, \mathbf{v}_d\rangle = 0 \right\}. \]

In fact, for any \(\mathbf{u} \neq \mathbf{0}\), we can normalize it by defining \(\mathbf{z} = \mathbf{u}/\|\mathbf{u}\|\) and we note that

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u}\rangle}{\langle \mathbf{u},\mathbf{u}\rangle} = \frac{\langle \mathbf{u}, A \mathbf{u}\rangle}{\|\mathbf{u}\|^2} = \left\langle \frac{\mathbf{u}}{\|\mathbf{u}\|}, A \frac{\mathbf{u}}{\|\mathbf{u}\|}\right\rangle = \langle \mathbf{z}, A \mathbf{z}\rangle. \]

So,

\[ \lambda_{d-1} = \min\left\{\langle \mathbf{z}, A \mathbf{z}\rangle\,:\ \|\mathbf{z}\|=1, \langle \mathbf{z}, \mathbf{v}_d\rangle = 0 \right\}. \]

5.3.3. General statement: Courant-Fischer#

Before stating a general result, we give one more example.

The second smallest eigenvalue (take two): Interestingly, there is a second characterization of the second smallest eigenvalue. Indeed, suppose we restrict the optimization to the space \(\mathcal{W}_{2} = \mathrm{span}(\mathbf{v}_{d-1},\mathbf{v}_d)\) instead. If \(\mathbf{u}\) is in the linear subspace \(\mathcal{W}_{2}\), it can be written as \(\mathbf{u} = \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i\) (since \(\mathbf{v}_{d-1},\mathbf{v}_{d}\) forms an orthonormal basis of it) and it follows that

\[ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2 \]
\[ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=d-1}^{d} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. \]

Thus,

\[ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=d-1}^{d} \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \leq \lambda_{d-1} \frac{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=d-1}^{d} \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_{d-1}, \]

where we used \(\lambda_{d-1} \geq \lambda_{d}\) and the fact that \(\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0\). Moreover \(\mathcal{R}_A(\mathbf{v}_{d-1}) = \lambda_{d-1}\). So we have established

\[ \lambda_{d-1} = \max_{\mathbf{0} \neq \mathbf{u} \in \mathcal{W}_{2}} \mathcal{R}_A(\mathbf{u}). \]

THEOREM (Courant-Fischer) \(\idx{Courant-Fischer theorem}\xdi\) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). For each \(k = 1,\ldots,d\), define the subspace

\[ \mathcal{V}_k = \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) \quad\text{and}\quad \mathcal{W}_{d-k+1} = \mathrm{span}(\mathbf{v}_k, \ldots, \mathbf{v}_d). \]

Then, for all \(k = 1,\ldots,d\),

\[ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}), \]

which are referred to as the local formulas. Furthermore we have the following min-max (or global) formulas, which do not depend on the choice of a spectral decomposition: for all \(k = 1,\ldots,d\),

\[ \lambda_k = \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}) = \min_{\mathrm{dim}(\mathcal{W}) = d-k+1} \max_{\mathbf{u} \in \mathcal{W}} \mathcal{R}_A(\mathbf{u}). \]

\(\sharp\)

Proof idea: For the local formula, we expand a vector in \(\mathcal{V}_k\) into the basis \(\mathbf{v}_1,\ldots,\mathbf{v}_k\) and use the fact that \(\mathcal{R}_A(\mathbf{v}_i) = \lambda_i\) and that eigenvalues are in non-increasing order. The global formulas follow from a dimension argument.

EXAMPLE: (Third Smallest Eigenvalue) The Courant-Fischer theorem can be used to recover the special cases of the previous subsection. One also gets new cases of interest. We give a characterization of the third smallest eigenvalue next. Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix with spectral decomposition \(A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T\) where \(\lambda_1 \geq \cdots \geq \lambda_d\). Using \(k=d-2\) in the Courant-Fischer theorem gives

\[ \lambda_{d-2} = \min_{\mathbf{u} \in \mathcal{V}_{d-2}} \mathcal{R}_A(\mathbf{u}), \]

where

\[ \mathcal{V}_{d-2} = \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_{d-2}). \]

It can be seen that

\[ \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_{d-2}) = \mathrm{span}(\mathbf{v}_{d-1}, \mathbf{v}_{d})^\perp. \]

So we finally arrive at the following characterization of the third smallest eigenvalue

\[ \lambda_{d-2} = \min\{\mathcal{R}_A(\mathbf{u})\,:\, \mathbf{0} \neq \mathbf{u}, \langle \mathbf{u},\mathbf{v}_d\rangle = 0, \langle \mathbf{u},\mathbf{v}_{d-1}\rangle = 0\}. \]

Using the same argument as we used for \(\lambda_{d-1}\), we get also

\[ \lambda_{d-2} = \min\left\{\langle \mathbf{z}, A \mathbf{z}\rangle\,:\ \|\mathbf{z}\|=1, \langle \mathbf{z}, \mathbf{v}_d\rangle = 0, \langle \mathbf{z}, \mathbf{v}_{d-1}\rangle = 0 \right\}. \]

\(\lhd\)

We now give a proof of the Courant-Fischer theorem. The local formulas follows from the same argument used to derive the special cases above so omit the general proof (but try to prove it!). The global formulas require a new idea.

Proof: (Courant-Fischer) \(\idx{Courant-Fischer theorem}\xdi\) Since \(\mathcal{V}_k\) has dimension \(k\), it follows from the local formula that

\[ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) \leq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Let \(\mathcal{V}\) be any subspace with dimension \(k\). Because \(\mathcal{W}_{d-k+1}\) has dimension \(d - k + 1\), we have that \(\dim(\mathcal{V}) + \mathrm{dim}(\mathcal{W}_{d-k+1}) > d\) and there must be non-zero vector \(\mathbf{u}_0\) in the intersection \(\mathcal{V} \cap \mathcal{W}_{d-k+1}\) (Prove it!).

We then have by the other local formula that

\[ \lambda_k = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}) \geq \mathcal{R}_A(\mathbf{u}_0) \geq \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Since this inequality holds for any subspace of dimension \(k\), we have

\[ \lambda_k \geq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). \]

Combining with inequality in the other direction above gives the claim. The other global formula is proved similarly. \(\square\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 According to the variational characterization of the largest eigenvalue \(\lambda_1\) of a symmetric matrix \(A\), which of the following is true?

a) \(\lambda_1 = \min_{u \neq 0} R_A(u)\)

b) \(\lambda_1 = \max_{u \neq 0} R_A(u)\)

c) \(\lambda_1 = \min_{\|u\| = 1} R_A(u)\)

d) \(\lambda_1 = \max_{\|u\| = 1} R_A(u)\)

2 Let \(V_{d-1} = \mathrm{span}(v_1, \ldots, v_{d-1})\), where \(v_1, \ldots, v_d\) are the eigenvectors of a symmetric matrix \(A\) with eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_d\). Which of the following characterizes the second smallest eigenvalue \(\lambda_{d-1}\)?

a) \(\lambda_{d-1} = \min_{0 \neq u \in V_{d-1}} R_A(u)\)

b) \(\lambda_{d-1} = \max_{0 \neq u \in V_{d-1}} R_A(u)\)

c) \(\lambda_{d-1} = \min_{\|u\| = 1, u \in V_{d-1}} R_A(u)\)

d) \(\lambda_{d-1} = \max_{\|u\| = 1, u \in V_{d-1}} R_A(u)\)

3 Let \(W_2 = \mathrm{span}(v_{d-1}, v_d)\), where \(v_1, \ldots, v_d\) are the eigenvectors of a symmetric matrix \(A\) with eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_d\). Which of the following characterizes the second smallest eigenvalue \(\lambda_{d-1}\)?

a) \(\lambda_{d-1} = \min_{0 \neq u \in W_2} R_A(u)\)

b) \(\lambda_{d-1} = \max_{0 \neq u \in W_2} R_A(u)\)

c) \(\lambda_{d-1} = \min_{\|u\| = 1, u \in W_2} R_A(u)\)

d) \(\lambda_{d-1} = \max_{\|u\| = 1, u \in W_2} R_A(u)\)

4 According to the Courant-Fischer Theorem, which of the following is the global formula for the \(k\)-th eigenvalue \(\lambda_k\) of a symmetric matrix \(A\)?

a) \(\lambda_k = \min_{u \in V_k} R_A(u) = \max_{u \in W_{d-k+1}} R_A(u)\)

b) \(\lambda_k = \max_{u \in V_k} R_A(u) = \min_{u \in W_{d-k+1}} R_A(u)\)

c) \(\lambda_k = \min_{\mathrm{dim}(V) = k} \min_{u \in V} R_A(u) = \max_{\mathrm{dim}(W) = d-k+1} \max_{u \in W} R_A(u)\)

d) \(\lambda_k = \max_{\mathrm{dim}(V) = k} \max_{u \in V} R_A(u) = \min_{\mathrm{dim}(W) = d-k+1} \min_{u \in W} R_A(u)\)

5 What is the main difference between the local and global formulas in the Courant-Fischer theorem?

a) The local formulas are easier to compute than the global formulas.

b) The local formulas depend on a specific choice of eigenvectors, while the global formulas do not.

c) The local formulas apply only to symmetric matrices, while the global formulas apply to any matrix.

d) The local formulas provide upper bounds on the eigenvalues, while the global formulas provide lower bounds.

Answer for 1: b. Justification: The text establishes that \(\lambda_1 = \max_{u \neq 0} R_A(u)\).

Answer for 2: a. Justification: The text establishes that \(\lambda_{d-1} = \min_{0 \neq u \in V_{d-1}} R_A(u)\).

Answer for 3: b. Justification: The text establishes that \(\lambda_{d-1} = \max_{0 \neq u \in W_2} R_A(u)\).

Answer for 4: c. Justification: The Courant-Fischer Theorem states that the global formula for the \(k\)-th eigenvalue is \(\lambda_k = \min_{\mathrm{dim}(V) = k} \min_{u \in V} R_A(u) = \max_{\mathrm{dim}(W) = d-k+1} \max_{u \in W} R_A(u)\).

Answer for 5: b. Justification: The text highlights that the global formulas “do not depend on the choice of spectral decomposition,” unlike the local formulas, which rely on a specific set of eigenvectors.