Background: review of matrix rank and spectral decomposition

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

4.2. Background: review of matrix rank and spectral decomposition#

We will need two additional concepts from linear algebra, the rank of a matrix and the spectral theorem.

4.2.1. Rank of a matrix#

Recall that the dimension of the column space of \(A\) is called the column rank of \(A\). Similarly the row rank of \(A\) is the dimension of its row space. As it turns out, these two notions of rank are the same by the Row Rank Equals Column Rank Theorem\(\idx{row rank equals column rank theorem}\xdi\). We give a proof of that theorem next. We refer to the row rank and column rank of \(A\) simply as the rank, which we denote by \(\mathrm{rk}(A)\). \(\idx{rank}\xdi\)

Proof idea: (Row Rank Equals Column Rank) Write \(A\) as a matrix factorization \(BC\) where the columns of \(B\) form a basis of \(\mathrm{col}(A)\). Then the rows of \(C\) necessarily form a spanning set of \(\mathrm{row}(A)\). So, because the number of columns of \(B\) and the number of rows of \(C\) match, we conclude that the row rank is less or equal than the column rank. Applying the same argument to the transpose gives the claim.

Recall the following observation from a previous chapter.

Observation D1: Any linear subspace \(U\) of \(\mathbb{R}^n\) has a basis.

Observation D2: If \(U\) and \(V\) are linear subspaces such that \(U \subseteq V\), then \(\mathrm{dim}(U) \leq \mathrm{dim}(V)\).

Observation D3: The dimension of \(\mathrm{span}(\mathbf{u}_1,\ldots,\mathbf{u}_m)\) is at most \(m\).

Proof: (Row Rank Equals Column Rank) Assume that \(A\) has column rank \(r\). Then there exists a basis \(\mathbf{b}_1,\ldots, \mathbf{b}_r \in \mathbb{R}^n\) of \(\mathrm{col}(A)\) by Observation D1 above, and we know that \(r \leq n\) by Observation D2. That is, for each \(j\), letting \(\mathbf{a}_{j} = A_{\cdot,j}\) be the \(j\)-th column of \(A\) we can write

\[ \mathbf{a}_{j} = \sum_{\ell=1}^r \mathbf{b}_\ell c_{\ell j} \]

for some \(c_{\ell j}\)’s. Let \(B\) be the matrix whose columns are \(\mathbf{b}_1, \ldots, \mathbf{b}_r\) and let \(C\) be the matrix with entries \((C)_{\ell j} = c_{\ell j}\), \(\ell=1,\ldots,r\), \(j=1,\ldots,m\). Then the equation above can be re-written as the matrix factorization \(A = BC\). Indeed, by our previous observations about matrix-matrix products, the columns of \(A\) are linear combinations of the columns of \(B\) with coefficients taken from the corresponding column of \(C\).

The key point is the following: \(C\) necessarily has \(r\) rows. Let \(\boldsymbol{\alpha}_{i}^T = A_{i,\cdot}\) be the \(i\)-th row of \(A\) and \(\mathbf{c}_{\ell}^T = C_{\ell,\cdot}\) be the \(\ell\)-th row of \(C\). Using our alternative representation of matrix-matrix product in terms of rows, the decomposition is equivalent to

\[ \boldsymbol{\alpha}_{i}^T = \sum_{\ell=1}^r b_{i\ell} \mathbf{c}_\ell^T, \quad i=1,\ldots, n, \]

where \(b_{i\ell} = (\mathbf{b}_i)_\ell = (B)_{i\ell}\) is the \(i\)-th entry of the \(\ell\)-th column of \(B\). In words, the rows of \(A\) are linear combinations of the rows of \(C\) with coefficients taken from the corresponding row of \(B\). In particular, \(\mathcal{C} = \{\mathbf{c}_{j}:j=1,\ldots,r\}\) is a spanning list of the row space of \(A\), that is, each row of \(A\) can be written as a linear combination of \(\mathcal{C}\). Put differently, \(\mathrm{row}(A) \subseteq \mathrm{span}(\mathcal{C})\).

So the row rank of \(A\) is at most \(r\), the column rank of \(A\), by Observation D2.

Applying the same argument to \(A^T\), which switches the role of the columns and the rows, gives that the column rank of \(A\) (i.e. the row rank of \(A^T\)) is at most the row rank of \(A\) (i.e. the column rank of \(A^T\)). Hence the two notions of rank must be equal. (We also deduce \(r \leq m\) by Observation D2 again.) \(\square\)

EXAMPLE: (continued) We illustrate the proof of the theorem. Continuing a previous example, let \(A\) be the matrix with columns \(\mathbf{w}_1 = (1,0,1)\), \(\mathbf{w}_2 = (0,1,1)\), and \(\mathbf{w}_3 = (1,-1,0)\)

\[\begin{split} A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 1 & -1\\ 1 & 1 & 0 \end{pmatrix}. \end{split}\]

We know that \(\mathbf{w}_1\) and \(\mathbf{w}_2\) form a basis of \(\mathrm{col}(A)\). We use them to construct our matrix \(B\)

\[\begin{split} B = \begin{pmatrix} 1 & 0\\ 0 & 1\\ 1 & 1 \end{pmatrix}. \end{split}\]

Recall that \(\mathbf{w}_3 = \mathbf{w}_1 - \mathbf{w}_2\). Hence the matrix \(C\) is

\[\begin{split} C = \begin{pmatrix} 1 & 0 & 1\\ 0 & 1 & -1 \end{pmatrix}. \end{split}\]

Indeed, column \(j\) of \(C\) gives the coefficients in the linear combination of the columns of \(B\) that produces column \(j\) of A. Check that \(A = B C\). \(\lhd\)

NUMERICAL CORNER: In Numpy, one can compute the rank of a matrix using the function numpy.linalg.matrix_rank. We will see later in the chapter how to compute it using the singular value decomposition (which is how LA.matrix_rank does it). Let’s try the example above.

w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
w3 = np.array([1., -1., 0.])
A = np.stack((w1, w2, w3), axis=-1)
print(A)

[[ 1.  0.  1.]
 [ 0.  1. -1.]
 [ 1.  1.  0.]]

We compute the rank of A.

LA.matrix_rank(A)

We take only the first two columns of A this time to form B.

B = np.stack((w1, w2),axis=-1)
print(B)

[[1. 0.]
 [0. 1.]
 [1. 1.]]

LA.matrix_rank(B)

Recall that, in Numpy, @ is used for matrix product.

C = np.array([[1., 0., 1.],[0., 1., -1.]])
print(C)

[[ 1.  0.  1.]
 [ 0.  1. -1.]]

LA.matrix_rank(C)

print(B @ C)

[[ 1.  0.  1.]
 [ 0.  1. -1.]
 [ 1.  1.  0.]]

\(\unlhd\)

EXAMPLE: Let \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{k \times m}\). Then we claim that

\[ \mathrm{rk}(AB) \leq \mathrm{rk}(A). \]

Indeed, the columns of \(AB\) are linear combinations of the columns of \(A\). Hence \(\mathrm{col}(AB) \subseteq \mathrm{col}(A)\). The claim follows by Observation D2. \(\lhd\)

EXAMPLE: Let \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{n \times m}\). Then we claim that

\[ \mathrm{rk}(A + B) \leq \mathrm{rk}(A) + \mathrm{rk}(B). \]

Indeed, the columns of \(A + B\) are linear combinations of the columns of \(A\) and of \(B\). Let \(\mathbf{u}_1,\ldots,\mathbf{u}_h\) be a basis of \(\mathrm{col}(A)\) and let \(\mathbf{v}_1,\ldots,\mathbf{v}_{\ell}\) be a basis of \(\mathrm{col}(B)\) by Observation D1. Then, we deduce

\[ \mathrm{col}(A + B) \subseteq \mathrm{span}(\mathbf{u}_1,\ldots,\mathbf{u}_h,\mathbf{v}_1,\ldots,\mathbf{v}_{\ell}). \]

By Observation D2, it follows that

\[ \mathrm{rk}(A + B) \leq \mathrm{dim}(\mathrm{span}(\mathbf{u}_1,\ldots,\mathbf{u}_h,\mathbf{v}_1,\ldots,\mathbf{v}_{\ell})). \]

By Observation D3, the right hand side is at most the length of the spanning list, i.e., \(h+\ell\). But by construction \(\mathrm{rk}(A) = h\) and \(\mathrm{rk}(B) = \ell\), so we are done. \(\lhd\)

EXAMPLE: (A Proof of the Rank-Nullity Theorem) \(\idx{rank-nullity theorem}\xdi\) Let \(A \in \mathbb{R}^{n \times m}\). Recall that the column space of \(A\), \(\mathrm{col}(A) \subseteq \mathbb{R}^n\), is the span of its columns. We compute its othogonal complement. By definition, the columns of \(A\), which we denote by \(\mathbf{a}_1,\ldots,\mathbf{a}_m\), form a spanning list of \(\mathrm{col}(A)\). So \(\mathbf{u} \in \mathrm{col}(A)^\perp \subseteq \mathbb{R}^n\) if and only if

\[ \mathbf{a}_i^T\mathbf{u} = \langle \mathbf{u}, \mathbf{a}_i \rangle = 0, \quad \forall i=1,\ldots,m. \]

Indeed that then implies that for any \(\mathbf{v} \in \mathrm{col}(A)\), say \(\mathbf{v} = \beta_1 \mathbf{a}_1 + \cdots + \beta_m \mathbf{a}_m\) we have

\[ \left\langle \mathbf{u}, \sum_{i=1}^m \beta_i \mathbf{a}_i \right\rangle = \sum_{i=1}^m \beta_i \langle \mathbf{u}, \mathbf{a}_i \rangle = 0. \]

The \(m\) conditions above can be written in matrix form as

\[ A^T \mathbf{u} = \mathbf{0}. \]

That is, the orthogonal complement of the column space of \(A\) is the null space of \(A^T\)

\[ \mathrm{col}(A)^\perp = \mathrm{null}(A^T). \]

Applying the same argument to the column space of \(A^T\), it follows that

\[ \mathrm{col}(A^T)^\perp = \mathrm{null}(A), \]

where note that \(\mathrm{null}(A) \subseteq \mathbb{R}^m\). The four linear subspaces \(\mathrm{col}(A)\), \(\mathrm{col}(A^T)\), \(\mathrm{null}(A)\) and \(\mathrm{null}(A^T)\) are referred to as the fundamental subspaces of \(A\). We have shown

\[ \mathrm{col}(A) \oplus \mathrm{null}(A^T) = \mathbb{R}^n \quad \text{and} \quad \mathrm{col}(A^T) \oplus \mathrm{null}(A) = \mathbb{R}^m \]

By the Row Rank Equals Column Rank Theorem, \(\mathrm{dim}(\mathrm{col}(A)) = \mathrm{dim}(\mathrm{col}(A^T))\). Moreover, by our previous observation about the dimensions of direct sums, we have

\[ n = \mathrm{dim}(\mathrm{col}(A)) + \mathrm{dim}(\mathrm{null}(A^T)) = \mathrm{dim}(\mathrm{col}(A^T)) + \mathrm{dim}(\mathrm{null}(A^T)) \]

and

\[ m = \mathrm{dim}(\mathrm{col}(A^T)) + \mathrm{dim}(\mathrm{null}(A)) = \mathrm{dim}(\mathrm{col}(A)) + \mathrm{dim}(\mathrm{null}(A)). \]

So we deduce that

\[ \mathrm{dim}(\mathrm{null}(A)) = m - \mathrm{rk}(A) \]

and

\[ \mathrm{dim}(\mathrm{null}(A^T)) = n - \mathrm{rk}(A). \]

These formulas are referred to the Rank-Nullity Theorem. The dimension of the null space of \(A\) is called the nullity of \(A\). \(\lhd\)

Outer products and rank-one matrices Let \(\mathbf{u} = (u_1,\ldots,u_n) \in \mathbb{R}^n\) and \(\mathbf{v} = (v_1,\ldots,v_m) \in \mathbf{R}^m\) be two column vectors. Their outer product\(\idx{outer product}\xdi\) is defined as the matrix

\[\begin{split} \mathbf{u} \mathbf{v}^T = \begin{pmatrix} u_1 v_1 & u_1 v_2 & \cdots & u_1 v_m\\ u_2 v_1 & u_2 v_2 & \cdots & u_2 v_m\\ \vdots & \vdots & \ddots & \vdots\\ u_n v_1 & u_n v_2 & \cdots & u_n v_m \end{pmatrix} = \begin{pmatrix} | & & | \\ v_{1} \mathbf{u} & \ldots & v_{m} \mathbf{u} \\ | & & | \end{pmatrix}. \end{split}\]

This is not to be confused with the inner product \(\mathbf{u}^T \mathbf{v}\), which requires \(n = m\) and produces a scalar.

If \(\mathbf{u}\) and \(\mathbf{v}\) are nonzero, the matrix \(\mathbf{u} \mathbf{v}^T\) has rank one. Indeed, its columns are all a multiple of the same vector \(\mathbf{u}\). So the column space spanned by the columns of \(\mathbf{u} \mathbf{v}^T\) is one-dimensional. Vice versa, any rank-one matrix can be written in this form by definition of the rank.

We have seen many different interpretations of matrix-matrix products. Here is yet another one. Let \(A = (a_{ij})_{i,j} \in \mathbb{R}^{n \times k}\) and \(B = (b_{ij})_{i,j} \in \mathbb{R}^{k \times m}\). Denote by \(\mathbf{a}_1,\ldots,\mathbf{a}_k\) the columns of \(A\) and denote by \(\mathbf{b}_1^T,\ldots,\mathbf{b}_k^T\) the rows of \(B\).

Then

\[\begin{align*} A B &= \begin{pmatrix} \sum_{j=1}^k a_{1j} b_{j1} & \sum_{j=1}^k a_{1j} b_{j2} & \cdots & \sum_{j=1}^k a_{1j} b_{jm}\\ \sum_{j=1}^k a_{2j} b_{j1} & \sum_{j=1}^k a_{2j} b_{j2} & \cdots & \sum_{j=1}^k a_{2j} b_{jm}\\ \vdots & \vdots & \ddots & \vdots\\ \sum_{j=1}^k a_{nj} b_{j1} & \sum_{j=1}^k a_{nj} b_{j2} & \cdots & \sum_{j=1}^k a_{nj} b_{jm} \end{pmatrix}\\ &= \sum_{j=1}^k \begin{pmatrix} a_{1j} b_{j1} & a_{1j} b_{j2} & \cdots & a_{1j} b_{jm}\\ a_{2j} b_{j1} & a_{2j} b_{j2} & \cdots & a_{2j} b_{jm}\\ \vdots & \vdots & \ddots & \vdots\\ a_{nj} b_{j1} & a_{nj} b_{j2} & \cdots & a_{nj} b_{jm} \end{pmatrix}\\ &= \sum_{j=1}^k \mathbf{a}_j \mathbf{b}_j^T. \end{align*}\]

In words, the matrix product \(AB\) can be interpreted as a sum of \(k\) rank-one matrices, each of which is the outer product of a column of \(A\) with the corresponding row of \(B\).

Because the rank of a sum is at most the sum of the ranks (as shown in a previous example), it follows that the rank of \(AB\) is at most \(k\). This is consistent with the fact that the rank of a product is at most the minimum of the ranks (also shown in a previous example).

4.2.2. Eigenvalues and eigenvectors#

Recall the concepts of eigenvalues\(\idx{eigenvalue}\xdi\) and eigenvectors\(\idx{eigenvector}\xdi\). We work on \(\mathbb{R}^d\).

DEFINITION (Eigenvalues and Eigenvectors) Let \(A \in \mathbb{R}^{d \times d}\) be a square matrix. Then \(\lambda \in \mathbb{R}\) is an eigenvalue of \(A\) if there exists a nonzero vector \(\mathbf{x} \neq \mathbf{0}\) such that

\[ A \mathbf{x} = \lambda \mathbf{x}. \]

The vector \(\mathbf{x}\) is referred to as an eigenvector. \(\natural\)

As the next example shows, not every matrix has a (real) eigenvalue.

EXAMPLE: (No Real Eigenvalues) Set \(d = 2\) and let

\[\begin{split} A = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}. \end{split}\]

For \(\lambda\) to be an eigenvalue, there must be an nonzero eigenvector \(\mathbf{x} = (x_1, x_2)\) such that

\[ A \mathbf{x} = \lambda \mathbf{x} \]

or put differently \(- x_2 = \lambda x_1\) and \(x_1 = \lambda x_2\). Replacing these equations into each other, it must be that \(- x_2 = \lambda^2 x_2\) and \(x_1 = - \lambda^2 x_1\). Because \(x_1, x_2\) cannot both be \(0\), \(\lambda\) must satisfy the equation \(\lambda^2 = -1\) for which there is no real solution. \(\lhd\)

In general, \(A \in \mathbb{R}^{d \times d}\) has at most \(d\) distinct eigenvalues.

LEMMA (Number of Eigenvalues) \(\idx{number of eigenvalues lemma}\xdi\) Let \(A \in \mathbb{R}^{d \times d}\) and let \(\lambda_1, \ldots, \lambda_m\) be distinct eigenvalues of \(A\) with corresponding eigenvectors \(\mathbf{x}_1, \ldots, \mathbf{x}_m\). Then \(\mathbf{x}_1, \ldots, \mathbf{x}_m\) are linearly independent. In particular, \(m \leq d\). \(\flat\)

Proof: Assume by contradiction that \(\mathbf{x}_1, \ldots, \mathbf{x}_m\) are linearly dependent. By the Linear Dependence Lemma, there is \(k \leq m\) such that

\[ \mathbf{x}_k \in \mathrm{span}(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}) \]

where \(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\) are linearly independent. In particular, there are \(a_1, \ldots, a_{k-1}\) such that

\[ \mathbf{x}_k = a_1 \mathbf{x}_1 + \cdots + a_{k-1} \mathbf{x}_{k-1}. \]

Transform the equation above in two ways: (1) multiply both sides by \(\lambda_k\) and (2) apply \(A\). Then subtract the resulting equations. That leads to

\[ \mathbf{0} = a_1 (\lambda_k - \lambda_1) \mathbf{x}_1 + \cdots + a_{k-1} (\lambda_k - \lambda_{k-1}) \mathbf{x}_{k-1}. \]

Because the \(\lambda_i\)’s are distinct and \(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\) are linearly independent, we must have \(a_1 = \cdots = a_{k-1} = 0\). But that implies that \(\mathbf{x}_k = \mathbf{0}\), a contradiction.

For the second claim, if there were more than \(d\) distinct eigenvalues, then there would be more than \(d\) corresponding linearly independent eigenvectors by the first claim, a contradiction. \(\square\)

EXAMPLE: (Diagonal (and Similar) Matrices) Recall that we use the notation \(\mathrm{diag}(\lambda_1,\ldots,\lambda_d)\) for the diagonal matrix\(\idx{diagonal matrix}\xdi\) with diagonal entries \(\lambda_1,\ldots,\lambda_d\). The upper bound in the Number of Eigenvalues Lemma can be achieved, for instance, by diagonal matrices with distinct diagonal entries \(A = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)\). Each standard basis vector \(\mathbf{e}_i\) is then an eigenvector

\[ A \mathbf{e}_i = \lambda_i \mathbf{e}_i. \]

More generally, let \(A\) be similar to a matrix \(D = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)\) with distinct diagonal entries, that is, there exists a nonsingular matrix \(P\) such that

\[ A = P D P^{-1}. \]

Let \(\mathbf{p}_1, \ldots, \mathbf{p}_d\) be the columns of \(P\). Note that, because the columns of \(P\) form a basis of \(\mathbb{R}^d\), the entries of the vector \(\mathbf{c} = P^{-1} \mathbf{x}\) are the coefficients of the unique linear combination of the \(\mathbf{p}_i\)’s equal to \(\mathbf{x}\). Indeed, \(P \mathbf{c} = \mathbf{x}\). Hence, \(A \mathbf{x}\) is can be thought of as: (1) expressing \(\mathbf{x}\) in the basis \(\mathbf{p}_1, \ldots, \mathbf{p}_d\), (2) and scaling the \(\mathbf{p}_i\)’s by the corresponding \(\lambda_i\)’s. In particular, the \(\mathbf{p}_i\)’s are eigenvectors of \(A\) since, by the above, \(P^{-1} \mathbf{p}_i = \mathbf{e}_i\) and so

\[ A \mathbf{p}_i = P D P^{-1} \mathbf{p}_i = P D \mathbf{e}_i = P (\lambda_i \mathbf{e}_i) = \lambda_i \mathbf{p}_i. \]

\(\lhd\)

NUMERICAL CORNER: In Numpy, the eigenvalues and eigenvectors of a matrix can be computed using numpy.linalg.eig.

A = np.array([[2.5, -0.5], [-0.5, 2.5]])
w, v = LA.eig(A)
print(w)
print(v)

[3. 2.]
[[ 0.70710678  0.70710678]
 [-0.70710678  0.70710678]]

Above, w are the eigenvalues in an array, whereas the columns of v are the corresponding eigenvectors.

\(\unlhd\)

Some matrix algebra We will need a few useful observations about matrices.

A (not necessarily square) matrix \(D \in \mathbb{R}^{k \times r}\) is diagonal if its non-diagonal entries are zero. That is, \(i \neq j\) implies that \(D_{ij} =0\). Note that a diagonal matrix is not necessarily square and that the diagonal elements are allowed to be zero.

Multiplying a matrix by a diagonal one has a very specific effect. Let \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{r \times m}\). We focus here on the case \(k \geq r\). The matrix product \(A D\) produces a matrix whose columns are linear combinations of the columns of \(A\) where the coefficients are taken from the corresponding column of \(D\). But the columns of \(D\) have at most one nonzero elements, the diagonal one. So the columns of \(AD\) are in fact multiples of the columns of \(A\)

\[\begin{split} AD = \begin{pmatrix} | & & | \\ d_{11} \mathbf{a}_1 & \ldots & d_{rr} \mathbf{a}_r \\ | & & | \end{pmatrix} \end{split}\]

where \(\mathbf{a}_1,\ldots,\mathbf{a}_k\) are the columns of \(A\) and \(d_{11},\ldots,d_{rr}\) are the diagonal elements of \(D\).

Similarly, the rows of \(D B\) are linear combinations of the rows of \(B\) where the coefficients are taken from the corresponding row of \(D\). The rows of \(D\) have at most one nonzero element, the diagonal one. In the case, \(k \geq r\), rows \(r+1,\ldots,k\) necessarily have only zero entries since there is no diagonal entry there. Hence the first \(r\) rows of \(DB\) are multiples of the rows of \(B\) and the next \(k-r\) are \(\mathbf{0}\)

\[\begin{split} DB = \begin{pmatrix} \horz & d_{11} \mathbf{b}_1^T & \horz \\ & \vdots &\\ \horz & d_{rr} \mathbf{b}_r^T & \horz\\ \horz & \mathbf{0} & \horz\\ & \vdots &\\ \horz & \mathbf{0} & \horz \end{pmatrix} \end{split}\]

where \(\mathbf{b}_1^T,\ldots,\mathbf{b}_r^T\) are the rows of \(B\).

EXAMPLE: The following special case will be useful later in this chapter. Suppose \(D, F \in \mathbb{R}^{n \times n}\) are both square diagonal matrices. Then \(D F\) is the matrix whose diagonal elements are \(d_{11} f_{11}, \ldots,d_{nn} f_{nn}\). \(\lhd\)

Spectral theorem When \(A\) is symmetric, a remarkable result is that there exists an orthonormal basis of \(\mathbb{R}^d\) made of eigenvectors of \(A\). We will prove this result in a subsequent chapter.

KNOWLEDGE CHECK: Let \(A \in \mathbb{R}^{d \times d}\) be symmetric. Show that for any \(\mathbf{u}\) and \(\mathbf{v}\) in \(\mathbb{R}^d\) it holds that

\[ \langle \mathbf{u}, A \mathbf{v} \rangle = \langle A \mathbf{u}, \mathbf{v} \rangle. \]

\(\checkmark\)

Before stating the result formally, we make a few observations. Let \(A \in \mathbb{R}^{d \times d}\) be symmetric. Suppose that \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are eigenvectors corresponding respectively to distinct eigenvalues \(\lambda_i\) and \(\lambda_j\). Then the following quantity can be written in two ways

\[ \langle \mathbf{v}_i, A \mathbf{v}_j \rangle = \langle \mathbf{v}_i, \lambda_j \mathbf{v}_j \rangle = \lambda_j \langle \mathbf{v}_i, \mathbf{v}_j \rangle \]

and, by symmetry of \(A\),

\[ \langle \mathbf{v}_i, A \mathbf{v}_j \rangle = \mathbf{v}_i^T A \mathbf{v}_j = \mathbf{v}_i^T A^T \mathbf{v}_j = (A \mathbf{v}_i)^T \mathbf{v}_j = \langle A \mathbf{v}_i, \mathbf{v}_j \rangle = \langle \lambda_i \mathbf{v}_i, \mathbf{v}_j \rangle = \lambda_i \langle \mathbf{v}_i, \mathbf{v}_j \rangle. \]

Subtracting the two,

\[ (\lambda_j - \lambda_i) \langle \mathbf{v}_i, \mathbf{v}_j \rangle = 0 \]

and using that \(\lambda_i \neq \lambda_j\)

\[ \langle \mathbf{v}_i, \mathbf{v}_j \rangle = 0. \]

That is, \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are necessarily orthogonal.

We proved:

LEMMA Let \(A \in \mathbb{R}^{d \times d}\) be symmetric. Suppose that \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are eigenvectors corresponding to distinct eigenvalues. Then \(\mathbf{v}_i\) and \(\mathbf{v}_j\) are orthogonal. \(\flat\)

This lemma gives a different proof – in the symmetric case – that the number of eigenvalues is at most \(d\) since a list of pairwise orthogonal vectors are linearly independent.

In fact:

THEOREM (Spectral) \(\idx{spectral theorem}\xdi\) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix, that is, \(A^T = A\). Then \(A\) has \(d\) orthonormal eigenvectors \(\mathbf{q}_1, \ldots, \mathbf{q}_d\) with corresponding (not necessarily distinct) real eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d\). Moreover, \(A\) can be written as the matrix factorization

\[ A = Q \Lambda Q^T = \sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^T \]

where \(Q\) has columns \(\mathbf{q}_1, \ldots, \mathbf{q}_d\) and \(\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)\). \(\sharp\)

We refer to this factorization as a spectral decomposition\(\idx{spectral decomposition}\xdi\) of \(A\).

Note that this decomposition indeed produces the eigenvectors of \(A\). For any \(j\), we have

\[ A \mathbf{q}_j = \sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^T \mathbf{q}_j = \lambda_j \mathbf{q}_j, \]

where we used that, by orthonormality, \(\mathbf{q}_i^T \mathbf{q}_j = 0\) if \(i \neq j\) and \(\mathbf{q}_i^T \mathbf{q}_j = 1\) if \(i = j\). The equation above says precisely that \(\mathbf{q}_j\) is an eigenvector of \(A\) with corresponding eigenvalue \(\lambda_j\). Since we have found \(d\) eigenvalues (possibly with repetition), we have found all of them.

Let \(\lambda_1, \lambda_2, \ldots, \lambda_d\) be the eigenvalues whose existence is guaranteed by the Spectral Theorem. We first argue that there is no other eigenvalue. Indeed, assume \(\mu \neq \lambda_1, \lambda_2, \ldots, \lambda_d\) is an eigenvalue with corresponding eigenvector \(\mathbf{p}\). We have seen that \(\mathbf{p}\) is orthogonal to the eigenvectors \(\mathbf{q}_1, \ldots, \mathbf{q}_d\). Since the latter list forms an orthonormal basis of \(\mathbb{R}^d\), this cannot be the case and we have a contradiction.

Some of the eigenvalues \(\lambda_1, \lambda_2, \ldots, \lambda_d\) can be repeated however, that is, there can be \(i, j\) such that \(\lambda_{i} = \lambda_{j}\). For instance, if \(A = I_{d \times d}\) is the identity matrix, then the eigenvalues are \(\lambda_i = 1\) for all \(i \in [d]\).

For a fixed eigenvalue \(\lambda\) of \(A\), the set of eigenvectors with eigenvalue \(\lambda\) satisfy

\[ A \mathbf{v} = \lambda \mathbf{v} = \lambda I_{d \times d} \mathbf{v} \]

or, rearranging,

\[ (A - \lambda I_{d \times d})\mathbf{v} = \mathbf{0}. \]

Put differently, it is the set \(\mathrm{null}(A - \lambda I_{d \times d})\).

The eigenvectors in the Spectral Theorem corresponding the same eigenvalue \(\lambda\) can be replaced by any orthonormal basis of the subspace \(\mathrm{null}(A - \lambda I_{d \times d})\). But, beyond this freedom, we have the following characterization: let \(\mu_1,\ldots,\mu_f\) be the unique values in \(\lambda_1, \lambda_2, \ldots, \lambda_d\)

\[ \mathbb{R}^d = \mathrm{null}(A - \mu_1 I_{d \times d}) \oplus \mathrm{null}(A - \mu_2 I_{d \times d}) \oplus \cdots \oplus \mathrm{null}(A - \mu_f I_{d \times d}), \]

where we used the fact that for any \(\mathbf{u} \in \mathrm{null}(A - \mu_i I_{d \times d})\) and \(\mathbf{v} \in \mathrm{null}(A - \mu_j I_{d \times d})\) with \(i \neq j\), we have that \(\mathbf{u}\) is orthogonal to \(\mathbf{v}\).

We have shown in particular that the sequence of eigenvalues in the Spectral Theorem is unique (counting repeats).

Two matrices \(B, D \in \mathbb{R}^{d \times d}\) are similar\(\idx{similar}\xdi\) if there is an invertible matrix \(P\) such that \(B = P^{-1} D P\). It can be shown that \(B\) and \(D\) correspond to the same linear map, but expressed in different bases. When \(P = Q\) is an orthogonal matrix, the transformation simplifies to \(B = Q^T D Q\).

Hence, a different way to think about a spectral decomposition is that it expresses the fact that any symmetric matrix is similar to a diagonal matrix through an orthogonal transformation. The basis in which the corresponding linear map is represented by a diagonal matrix is the basis of eigenvectors.

EXAMPLE: (Eigendecomposition of \(2 \times 2\) symmetric matrix) The simplest non-trivial case is the \(2 \times 2\) symmetric matrix

\[\begin{split} A = \begin{pmatrix} a & b\\ b & d \end{pmatrix}. \end{split}\]

We derive a step-by-step recipe to compute its eigenvalues and eigenvectors.

As shown previously, an eigenvalue \(\lambda\) corresponds to a nonempty \(\mathrm{null}(A - \lambda I_{2 \times 2})\) and the corresponding eigenvector solves

\[\begin{split} \mathbf{0} = (A - \lambda I_{2 \times 2})\mathbf{v} = \begin{pmatrix}a - \lambda & b\\ b & d - \lambda\end{pmatrix}. \end{split}\]

Put differently, the matrix \(\begin{pmatrix}a - \lambda & b\\ b & d - \lambda\end{pmatrix}\) has linearly dependent columns. We have seen that one way to check this is to compute the determinant which in the \(2 \times 2\) case is simply

\[\begin{split} \mathrm{det}\left[\begin{pmatrix}a - \lambda & b\\ b & d - \lambda\end{pmatrix}\right] = (a - \lambda ) (d - \lambda) - b^2. \end{split}\]

This is a polynomial of degree \(2\) in \(\lambda\) called the characteristic polynomial of the matrix \(A\).

The roots of the characteristic polynomial, that is, the solutions of

\[ 0 = (a - \lambda ) (d - \lambda) - b^2 = \lambda^2 - (a + d)\lambda + (ad - b^2), \]

are

\[ \lambda_{1} = \frac{(a + d) + \sqrt{(a + d)^2 - 4(ad - b^2)}}{2} \]

and

\[ \lambda_{2} = \frac{(a + d) - \sqrt{(a + d)^2 - 4(ad - b^2)}}{2}. \]

Expanding the expression in the square root

\[ (a + d)^2 - 4(ad - b^2) = a^2 + d^2 - 2ad + 4b^2 = (a - d)^2 + 4b^2, \]

we see that the square root is well-defined (i.e., produces a real value) for any \(a, b, d\).

It remains to find the corresponding eigenvectors \(\mathbf{v}_{1} = (v_{1,1}, v_{1,2})\) and \(\mathbf{v}_2 = (v_{2,1}, v_{2,2})\) by solving the \(2 \times 2\) systems of linear equations

\[\begin{split} \begin{pmatrix} a - \lambda_i & b \\ b & d - \lambda_i \end{pmatrix} \begin{pmatrix} v_{i,1} \\ v_{i,2} \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \end{split}\]

which is guaranteed to have a solution. When \(\lambda_1 = \lambda_2\), one needs to find two linearly independent solutions.

Here is a numerical example. Consider the matrix

\[\begin{split} A = \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix}. \end{split}\]

The characteristic polynomial equation is

\[ \lambda^2 - 6\lambda + 8 = 0. \]

The eigenvalues are

\[ \lambda_{1}, \lambda_{2} = \frac{6 \pm \sqrt{36 - 4(9-1))}}{2} = \frac{6 \pm \sqrt{4}}{2} = 4, 2. \]

We then solve for the eigenvectors. For \(\lambda_1 = 4\)

\[\begin{split} \begin{pmatrix} 3 - \lambda_1 & 1 \\ 1 & 3 - \lambda_1 \end{pmatrix} \begin{pmatrix} v_{1,1} \\ v_{1,2} \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \Leftrightarrow \begin{cases} - v_{1,1} + v_{1,2} = 0\\ v_{1,1} - v_{1,2} = 0 \end{cases} \end{split}\]

so, after normalizing, we take

\[\begin{split} \mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ 1\end{pmatrix}. \end{split}\]

For \(\lambda_2 = 2\):

\[\begin{split} \begin{pmatrix} 3 - \lambda_2 & 1 \\ 1 & 3 - \lambda_2 \end{pmatrix} \begin{pmatrix} v_{2,1} \\ v_{2,2} \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \Leftrightarrow \begin{cases} v_{1,1} + v_{1,2} = 0\\ v_{1,1} + v_{1,2} = 0 \end{cases} \end{split}\]

so, after normalizing, we take

\[\begin{split} \mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 \\ -1\end{pmatrix}. \end{split}\]

The fact that these are the eigenvectors can be checked by hand (try it!). \(\lhd\)

More generally, for any matrix \(A \in \mathbb{R}^{d \times d}\), the roots of the characteristic polynomial \(\mathrm{det}(A - \lambda I_{d \times d})\) are eigenvalues of \(A\). We will derive a more efficient numerical approach to compute them in a subsequent section.

The case of positive semidefinite matrices The eigenvalues of a symmetric matrix – while real – may be negative. There is however an important special case where the eigenvalues are nonnegative, positive semidefinite matrices. \(\idx{positive semidefinite matrix}\xdi\)

THEOREM (Characterization of Positive Semidefiniteness) \(\idx{characterization of positive semidefiniteness}\xdi\) Let \(A \in \mathbb{R}^{d \times d}\) be a symmetric matrix and let \(A = Q \Lambda Q^T\) be a spectral decomposition of \(A\) with \(\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)\). Then \(A \succeq 0\) if and only if its eigenvalues \(\lambda_1, \ldots, \lambda_d\) are nonnegative. \(\sharp\)

Proof: Assume \(A \succeq 0\). Let \(\mathbf{q}_i\) be an eigenvector of \(A\) with corresponding eigenvalue \(\lambda_i\). Then

\[ \langle \mathbf{q}_i, A \mathbf{q}_i \rangle = \langle \mathbf{q}_i, \lambda_i \mathbf{q}_i \rangle = \lambda_i \]

which must be nonnegative by definition of a positive semidefinite matrix.

In the other direction, assume \(\lambda_1, \ldots, \lambda_d \geq 0\). Then, by the spectral decomposition in outer-product form

\[ \langle \mathbf{x}, A \mathbf{x} \rangle = \mathbf{x}^T \left(\sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^T\right) \mathbf{x} = \sum_{i=1}^d \lambda_i \mathbf{x}^T\mathbf{q}_i \mathbf{q}_i^T\mathbf{x} = \sum_{i=1}^d \lambda_i \langle \mathbf{q}_i, \mathbf{x}\rangle^2 \]

which is necessarily nonnegative. \(\square\)

Similarly, a symmetric matrix is positive definite\(\idx{positive definite matrix}\xdi\) if and only if all its eigenvalues are strictly positive. The proof is essentially the same.

KNOWLEDGE CHECK: Prove this last claim by modifying the proof above. \(\checkmark\)

Recall that an important application of positive semidefiniteness is as a characterization of convexity\(\idx{convex function}\xdi\). Here are some examples.

EXAMPLE: (Convexity via eigenvalues of Hessian) Consider the function

\[ f(x, y) = \frac{3}{2} x^2 + xy + \frac{3}{2} y^2 + 5x - 2y + 1. \]

To show it is convex, we compute its Hessian

\[\begin{split} H_f(x,y) = \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix} \end{split}\]

for all \(x,y\). By a previous example, its eigenvalues are \(2\) and \(4\), both of which are strictly positive. That proves the claim by the Second-Order Convexity Condition. \(\lhd\)

EXAMPLE: (Log-concavity) A function \(f :\mathbb{R}^d \to \mathbb{R}\) is said to be log-concave if \(-\log f\) is convex. Put differently, we require for all \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^d\) and \(\alpha \in (0,1)\)

\[ - \log f((1-\alpha)\mathbf{x} + \alpha \mathbf{y}) \leq - (1-\alpha) \log f(\mathbf{x}) - \alpha \log f(\mathbf{y}), \]

This is equivalent to

\[ \log f((1-\alpha)\mathbf{x} + \alpha \mathbf{y}) \geq (1-\alpha) \log f(\mathbf{x}) + \alpha \log f(\mathbf{y}) , \]

Or, because \(a \log b = \log b^a\) and the logarithm is strictly increasing,

\[ f((1-\alpha)\mathbf{x} + \alpha \mathbf{y}) \geq f(\mathbf{x})^{1-\alpha} f(\mathbf{y})^{\alpha}. \]

We will see later in the course that a multivariate Gaussian vector \(\mathbf{X}\) on \(\mathbb{R}^d\) with mean \(\bmu \in \mathbb{R}^d\) and positive definite covariance matrix \(\bSigma \in \mathbb{R}^{d \times d}\) has probability density function (PDF)

\[ f_{\bmu, \bSigma}(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} \,|\bSigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\right) \]

where \(|A|\) is the determinant of \(A\), which in the case of a symmetric matrix is simply the product of its eigenvalues (with repeats). We claim that this PDF is log-concave.

From the definition,

\[\begin{align*} &- \log f_{\bmu, \bSigma}(\mathbf{x})\\ &= \frac{1}{2}(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu) + \log (2\pi)^{d/2} \,|\bSigma|^{1/2}\\ &= \frac{1}{2}\mathbf{x}^T \bSigma^{-1} \mathbf{x} - \bmu^T \bSigma^{-1} \mathbf{x} + \left[\frac{1}{2}\bmu^T \bSigma^{-1} \bmu + \log (2\pi)^{d/2} \,|\bSigma|^{1/2}\right]. \end{align*}\]

Let \(P = \bSigma^{-1}\), \(\mathbf{q} = - \bmu^T \bSigma^{-1}\) and \(r = \frac{1}{2}\bmu^T \bSigma^{-1} \bmu + \log (2\pi)^{d/2} \,|\bSigma|^{1/2}\).

A previous example then implies that the PDF is log-concave if \(\bSigma^{-1}\) is positive semidefinite. Since \(\bSigma\) is positive definite by assumption, \(\bSigma = Q \Lambda Q^T\) has a spectral decomposition where all diagonal entries of \(\Lambda\) are stricly positive. Then \(\bSigma^{-1} = Q \Lambda^{-1} Q^T\) where the diagonal entries of \(\Lambda^{-1}\) are the inverses of those of \(\Lambda\) - and hence strictly positive as well. In particular, \(\bSigma^{-1}\) is positive semidefinite. \(\lhd\)

NUMERICAL CORNER: Hence, we can check whether a matrix is positive semidefinite by computing its eigenvalues using numpy.linalg.eig.

A = np.array([[1, -1], [-1, 1]])
w, v = LA.eig(A)
print(w)

[2. 0.]

B = np.array([[1, -2], [-2, 1]])
z, u = LA.eig(B)
print(z)

[ 3. -1.]

KNOWLEDGE CHECK: Which one(s) of these matrices is positive semidefinite?

\[\begin{split} A = \begin{pmatrix} 1 & -1\\ -1 & 1 \end{pmatrix} \qquad B = \begin{pmatrix} 1 & -2\\ -2 & 1 \end{pmatrix} \end{split}\]

a) Both

b) \(A\)

c) \(B\)

d) Neither

\(\checkmark\)

\(\unlhd\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 What is the rank of a matrix \(A \in \mathbb{R}^{n \times m}\)?

a) The number of non-zero entries in \(A\).

b) The dimension of the row space of \(A\).

c) The dimension of the null space of \(A\).

d) The trace of \(A\).

2 Which of the following is true about the rank of a matrix \(A \in \mathbb{R}^{n \times m}\)?

a) \(\mathrm{rk}(A) \leq \min\{n,m\}\)

b) \(\mathrm{rk}(A) \geq \max\{n,m\}\)

c) \(\mathrm{rk}(A) = \mathrm{rk}(A^T)\) only if \(A\) is symmetric

d) \(\mathrm{rk}(A) = \mathrm{rk}(A^T)\) only if \(A\) is square

3 Let \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{k \times m}\). Which of the following is true in general?

a) \(\mathrm{rk}(AB) \leq \mathrm{rk}(A)\)

b) \(\mathrm{rk}(AB) \geq \mathrm{rk}(A)\)

c) \(\mathrm{rk}(AB) = \mathrm{rk}(A)\)

d) \(\mathrm{rk}(AB) = \mathrm{rk}(B)\)

4 Let \(A \in \mathbb{R}^{d \times d}\) be symmetric. Which of the following is true according to the Spectral Theorem?

a) \(A\) has at most \(d\) distinct eigenvalues

b) \(A\) has exactly \(d\) distinct eigenvalues

c) \(A\) has at least \(d\) distinct eigenvalues

d) The number of distinct eigenvalues of \(A\) is unrelated to \(d\)

5 Which of the following is true about the outer product of two vectors \(\mathbf{u}\) and \(\mathbf{v}\)?

a) It is a scalar.

b) It is a vector.

c) It is a matrix of rank one.

d) It is a matrix of rank zero.

Answer for 1: b. Justification: The text states “the row rank and column rank of \(A\) [are] simply […] the rank, which we denote by \(\mathrm{rk}(A)\).”

Answer for 2: a. Justification: The text states in the Row Rank Equals Column Rank Theorem that “the row rank of \(A\) equals the column rank of \(A\). Moreover, \(\mathrm{rk}(A) \leq \min\{n,m\}\).”

Answer for 3: a. Justification: The text shows that “the columns of \(AB\) are linear combinations of the columns of \(A\). Hence \(\mathrm{col}(AB) \subseteq \mathrm{col}(A)\). The claim follows by Observation D2.”

Answer for 4: a. Justification: The Spectral Theorem states that “A symmetric matrix \(A\) has \(d\) orthonormal eigenvectors \(\mathbf{q}_1, \ldots, \mathbf{q}_d\) with corresponding (not necessarily distinct) real eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d\).”

Answer for 5: c. Justification: The text defines the outer product and states that “If \(\mathbf{u}\) and \(\mathbf{v}\) are nonzero, the matrix \(\mathbf{u} \mathbf{v}^T\) has rank one.”

Background: review of matrix rank and spectral decomposition

Contents

4.2. Background: review of matrix rank and spectral decomposition#

4.2.1. Rank of a matrix#

4.2.2. Eigenvalues and eigenvectors#