More advanced material: low-rank approximation; why project; additional proofs

\(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\)

3.7. More advanced material: low-rank approximation; why project; additional proofs#

We cover various advanced topics.

3.7.1. Low-rank approximation#

Now that we have defined a notion of distance between matrices, we will consider the problem of finding a good approximation to a matrix \(A\) among all matrices of rank at most \(k\). We will start with the Frobenius norm, which is easier to work with, and we will show later on that the solution is the same under the induced norm.

The solution to this problem will be familiar. In essence, we will re-interpret our solution to the best approximating subspace as a low-rank approximation.

Low-rank approximation in the Frobenius norm From the proof of the Row Rank Equals Column Rank Lemma, it follows that a rank-\(r\) matrix \(A\) can be written as a sum of \(r\) rank-\(1\) matrices

\[ A = \sum_{i=1}^r \mathbf{b}_i \mathbf{c}_i^T. \]

We will now consider the problem of finding a “simpler” approximation to \(A\)

\[ A \approx \sum_{i=1}^k \mathbf{b}'_i (\mathbf{c}'_i)^T \]

where \(k < r\). Here we measure the quality of this approximation using a matrix norm.

We are ready to state our key observation. In words, the best rank-\(k\) approximation to \(A\) in Frobenius norm is obtained by projecting the rows of \(A\) onto a linear subspace of dimension \(k\). We will come back to how one finds the best such subspace below. (Hint: We have already solved this problem.)

LEMMA (Projection and Rank-\(k\) Approximation) Let \(A \in \mathbb{R}^{n \times m}\). For any matrix \(B \in \mathbb{R}^{n \times m}\) of rank \(k \leq \min\{n,m\}\),

\[ \|A - B_{\perp}\|_F \leq \|A - B\|_F \]

where \(B_{\perp} \in \mathbb{R}^{n \times m}\) is the matrix of rank at most \(k\) obtained as follows. Denote row \(i\) of \(A\), \(B\) and \(B_{\perp}\) respectively by \(\boldsymbol{\alpha}_i^T\), \(\mathbf{b}_{i}^T\) and \(\mathbf{b}_{\perp,i}^T\), \(i=1,\ldots, n\). Set \(\mathbf{b}_{\perp,i}\) to be the orthogonal projection of \(\boldsymbol{\alpha}_i\) onto \(\mathcal{Z} = \mathrm{span}(\mathbf{b}_1,\ldots,\mathbf{b}_n)\). \(\flat\)

Proof idea: The square of the Frobenius norm decomposes as a sum of squared row norms. Each term in the sum is minimized by the orthogonal projection.

Proof: By definition of the Frobenius norm, we note that

\[ \|A - B\|_F^2 = \sum_{i=1}^n \sum_{j=1}^m (a_{i,j} - b_{i,j})^2 = \sum_{i=1}^n \|\boldsymbol{\alpha}_i - \mathbf{b}_{i}\|^2 \]

and similarly for \(\|A - B_{\perp}\|_F\). We make two observations:

  1. Because the orthogonal projection of \(\boldsymbol{\alpha}_i\) onto \(\mathcal{Z}\) minimizes the distance to \(\mathcal{Z}\), it follows that term by term \(\|\boldsymbol{\alpha}_i - \mathbf{b}_{\perp,i}\| \leq \|\boldsymbol{\alpha}_i - \mathbf{b}_{i}\|\) so that

\[ \|A - B_\perp\|_F^2 = \sum_{i=1}^n \|\boldsymbol{\alpha}_i - \mathbf{b}_{\perp,i}\|^2 \leq \sum_{i=1}^n \|\boldsymbol{\alpha}_i - \mathbf{b}_{i}\|^2 = \|A - B\|_F^2. \]
  1. Moreover, because the projections satisfy \(\mathbf{b}_{\perp,i} \in \mathcal{Z}\) for all \(i\), \(\mathrm{row}(B_\perp) \subseteq \mathrm{row}(B)\) and, hence, the rank of \(B_\perp\) is at most the rank of \(B\).

That concludes the proof. \(\square\)

Recall the approximating subspace problem. That is, think of the rows \(\boldsymbol{\alpha}_i^T\) of \(A \in \mathbb{R}^{n \times m}\) as a collection of \(n\) data points in \(\mathbb{R}^m\). We are looking for a linear subspace \(\mathcal{Z}\) that minimizes

\[ \sum_{i=1}^n \|\boldsymbol{\alpha}_i - \mathrm{proj}_\mathcal{Z}(\boldsymbol{\alpha}_i)\|^2 \]

over all linear subspaces of \(\mathbb{R}^m\) of dimension at most \(k\). By the Projection and Rank-\(k\) Approximation Lemma, this problem is equivalent to finding a matrix \(B\) that minimizes

\[ \|A - B\|_F \]

among all matrices in \(\mathbb{R}^{n \times m}\) of rank at most \(k\). Of course we have solved this problem before.

Let \(A \in \mathbb{R}^{n \times m}\) be a matrix with SVD

\[ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T. \]

For \(k < r\), truncate the sum at the \(k\)-th term

\[ A_k = \sum_{j=1}^k \sigma_j \mathbf{u}_j \mathbf{v}_j^T. \]

The rank of \(A_k\) is exactly \(k\). Indeed, by construction,

  1. the vectors \(\{\mathbf{u}_j\,:\,j = 1,\ldots,k\}\) are orthonormal, and

  2. since \(\sigma_j > 0\) for \(j=1,\ldots,k\) and the vectors \(\{\mathbf{v}_j\,:\,j = 1,\ldots,k\}\) are orthonormal, \(\{\mathbf{u}_j\,:\,j = 1,\ldots,k\}\) spans the column space of \(A_k\).

We have shown before that \(A_k\) is the best approximation to \(A\) among matrices of rank at most \(k\) in Frobenius norm. Specifically, the Greedy Finds Best Fit Theorem implies that, for any matrix \(B \in \mathbb{R}^{n \times m}\) of rank at most \(k\),

\[ \|A - A_k\|_F \leq \|A - B\|_F. \]

This result is known as the Eckart-Young Theorem. It also holds in the induced \(2\)-norm, as we show next.

Low-rank approximation in the induced norm We show in this section that the same holds in the induced norm. First, some observations.

LEMMA (Matrix Norms and Singular Values: Truncation) Let \(A \in \mathbb{R}^{n \times m}\) be a matrix with SVD

\[ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T \]

where recall that \(\sigma_1 \geq \sigma_2 \geq \cdots \sigma_r > 0\) and let \(A_k\) be the truncation defined above. Then

\[ \|A - A_k\|^2_F = \sum_{j=k+1}^r \sigma_j^2 \]

and

\[ \|A - A_k\|^2_2 = \sigma_{k+1}^2. \]

\(\flat\)

Proof: For the first claim, by definition, summing over the columns of \(A - A_k\)

\[ \|A - A_k\|^2_F = \left\|\sum_{j=k+1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T\right\|_F^2 = \sum_{i=1}^m \left\|\sum_{j=k+1}^r \sigma_j v_{j,i} \mathbf{u}_j \right\|^2. \]

Because the \(\mathbf{u}_j\)’s are orthonormal, this is

\[ \sum_{i=1}^m \sum_{j=k+1}^r \sigma_j^2 v_{j,i}^2 = \sum_{j=k+1}^r \sigma_j^2 \left(\sum_{i=1}^m v_{j,i}^2\right) = \sum_{j=k+1}^r \sigma_j^2 \]

where we used that the \(\mathbf{v}_j\)’s are also orthonormal.

For the second claim, recall that the induced norm is defined as

\[ \|B\|_2 = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|B \mathbf{x}\|. \]

For any \(\mathbf{x} \in \mathbb{S}^{m-1}\)

\[ \left\|(A - A_k)\mathbf{x}\right\|^2 = \left\| \sum_{j=k+1}^r \sigma_j \mathbf{u}_j (\mathbf{v}_j^T \mathbf{x}) \right\|^2 = \sum_{j=k+1}^r \sigma_j^2 \langle \mathbf{v}_j, \mathbf{x}\rangle^2. \]

Because the \(\sigma_j\)’s are in decreasing order, this is maximized when \(\langle \mathbf{v}_j, \mathbf{x}\rangle = 1\) if \(j=k+1\) and \(0\) otherwise. That is, we take \(\mathbf{x} = \mathbf{v}_{k+1}\) and the norm is then \(\sigma_{k+1}^2\), as claimed. \(\square\)

THEOREM (Low-Rank Approximation in the Induced Norm) Let \(A \in \mathbb{R}^{n \times m}\) be a matrix with SVD

\[ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T \]

and let \(A_k\) be the truncation defined above with \(k < r\). For any matrix \(B \in \mathbb{R}^{n \times m}\) of rank at most \(k\),

\[ \|A - A_k\|_2 \leq \|A - B\|_2. \]

\(\sharp\)

Proof idea: We know that \(\|A - A_k\|_2^2 = \sigma_{k+1}^2\). So we want to lower bound \(\|A - B\|_2^2\) by \(\sigma_{k+1}^2\). For that, we have to find an appropriate \(\mathbf{z}\) for any given \(B\) of rank at most \(k\). The idea is to take a vector \(\mathbf{z}\) in the intersection of the null space of \(B\) and the span of the singular vectors \(\mathbf{v}_1,\ldots,\mathbf{v}_{k+1}\). By the former, the squared norm of \((A - B)\mathbf{z}\) is equal to the squared norm of \(A\mathbf{z}\) which lower bounds \(\|A\|_2^2\). By the latter, \(\|A \mathbf{z}\|^2\) is at least \(\sigma_{k+1}^2\).

We previously established the following lemma.

THEOREM (Rank-Nullity) Let \(B \in \mathbb{R}^{n \times m}\) be a matrix of rank \(= k\). Then \(\mathrm{dim}(\mathrm{null}(B)) = m - k\). Put differently,

\[ \mathrm{dim}(\mathrm{col}(B)) + \mathrm{dim}(\mathrm{null}(B)) = m. \]

\(\sharp\)

Proof: By the Rank-Nullity Theorem, the dimension of \(\mathrm{null}(B)\) is at least \(m-k\) so by Exercise 3.20 there is a unit vector \(\mathbf{z}\) in the intersection

\[ \mathbf{z} \in \mathrm{null}(B) \cap \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{k+1}). \]

Then \((A-B)\mathbf{z} = A\mathbf{z}\) since \(\mathbf{z} \in \mathrm{null}(B)\). Also since \(\mathbf{z} \in \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{k+1})\), and therefore orthogonal to \(\mathbf{v}_{k+2},\ldots,\mathbf{v}_r\), we have

\[\begin{align*} \|(A-B)\mathbf{z}\|^2 &= \|A\mathbf{z}\|^2\\ &= \left\|\sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T\mathbf{z}\right\|^2\\ &= \left\|\sum_{j=1}^{k+1} \sigma_j \mathbf{u}_j \mathbf{v}_j^T\mathbf{z}\right\|^2\\ &= \sum_{j=1}^{k+1} \sigma_j^2 \langle \mathbf{v}_j, \mathbf{z}\rangle^2\\ &\geq \sigma_{k+1}^2 \sum_{j=1}^{k+1} \langle \mathbf{v}_j, \mathbf{z}\rangle^2\\ &= \sigma_{k+1}^2. \end{align*}\]

By the Matrix Norms and Singular Values Lemma, \(\sigma_{k+1}^2 = \|A - A_k\|_2\) and we are done. \(\square\)

3.7.2. Why project?#

We return to \(k\)-means clustering and why projecting to a lower-dimensional subspace can produce better results. We prove a simple inequality that provides some insight. Quoting [BHK, Section 7.5.1]:

[…] let’s understand the central advantage of doing the projection to [the top \(k\) right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.

To elaborate, suppose we have \(n\) data points in \(d\) dimensions in the form of the rows \(\mathbf{a}_i^T\), \(i=1\ldots, n\), of matrix \(A \in \mathbb{A}^{n \times d}\), where we assume that \(n > d\) and that \(A\) has full column rank. Imagine these data points come from an unknown ground-truth \(k\)-clustering assignment \(g(i) \in [k]\), \(i = 1,\ldots, n\), with corresponding unknown centers \(\mathbf{c}_j\), \(j = 1,\ldots, k\), for \(g(i) = j\). Let \(C \in \mathbb{R}^{n \times d}\) be the corresponding matrix, that is, row \(i\) of \(C\) is \(\mathbf{c}_j^T\) if \(g(i) = j\). The \(k\)-means objective of the true clustering is then

()#\[\begin{align} \sum_{j\in [k]} \sum_{i:g(i)=j} \|\mathbf{a}_i - \mathbf{c}_{j}\|^2 &= \sum_{j\in [k]} \sum_{i:g(i)=j} \sum_{\ell=1}^d (a_{i,\ell} - c_{j,\ell})^2\\ &= \sum_{i=1}^n \sum_{\ell=1}^d (a_{i,\ell} - c_{g(i),\ell})^2\\ &= \|A - C\|_F^2. \end{align}\]

The matrix \(A\) has an SVD

\[ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T \]

and for \(k < r\) we have the truncation

\[ A_k = \sum_{j=1}^k \sigma_j \mathbf{u}_j \mathbf{v}_j^T. \]

It corresponds to projecting each row of \(A\) onto the linear subspace spanned by the first \(k\) right singular vectors \(\mathbf{v}_1, \ldots, \mathbf{v}_k\). To see this, note that the \(i\)-th row of \(A\) is \(\boldsymbol{\alpha}_i^T = \sum_{j=1}^r \sigma_j u_{j,i} \mathbf{v}_j^T\) and that, because the \(\mathbf{v}_j\)’s are linearly independent and in particular \(\mathbf{v}_1,\ldots,\mathbf{v}_k\) is an orthonormal basis of its span, the projection of \(\boldsymbol{\alpha}_i\) onto \(\mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_k)\) is

\[ \sum_{\ell=1}^k \mathbf{v}_\ell \left\langle\sum_{j=1}^r \sigma_j u_{j,i} \mathbf{v}_j,\mathbf{v}_\ell\right\rangle = \sum_{\ell=1}^k \sigma_\ell u_{\ell,i} \mathbf{v}_\ell \]

which is the \(i\)-th row of \(A_k\). The \(k\)-means objective of \(A_k\) with respect to the ground-truth centers \(\mathbf{c}_j\), \(j=1,\ldots,k\), is \(\|A_k - C\|_F^2\).

One more observation: the rank of \(C\) is at most \(k\). Indeed, there are \(k\) different rows in \(C\) so its row rank is \(k\) if these different rows are linearly independent and less than \(k\) otherwise.

THEOREM (Why Project) Let \(A \in \mathbb{A}^{n \times d}\) be a matrix and let \(A_k\) be the truncation above. For any matrix \(C \in \mathbb{R}^{n \times d}\) of rank \(\leq k\),

\[ \|A_k - C\|_F^2 \leq 8 k \|A - C\|_2^2. \]

\(\sharp\)

The content of this inequality is the following. The quantity \(\|A_k - C\|_F^2\) is the \(k\)-means objective of the projection \(A_k\) with respect to the true centers, that is, the sum of the squared distances to the centers. By the Matrix Norms and Singular Values Lemma, the inequality above gives that

\[ \|A_k - C\|_F^2 \leq 8 k \sigma_1(A - C)^2, \]

where \(\sigma_j(A - C)\) is the \(j\)-th singular value of \(A - C\). On the other hand, by the same lemma, the \(k\)-means objective of the un-projected data is

\[ \|A - C\|_F^2 = \sum_{j=1}^{\mathrm{rk}(A-C)} \sigma_j(A - C)^2. \]

If the rank of \(A-C\) is much larger than \(k\) and the singular values of \(A-C\) decay slowly, then the latter quantity may be much larger. In other words, projecting may bring the data points closer to their true centers, potentially making it easier to cluster them.

Proof: (Why Project) We have shown previously that, for any matrices \(A, B \in \mathbb{R}^{n \times m}\), the rank of their sum is less or equal than the sum of their ranks, that is,

\[ \mathrm{rk}(A+B) \leq \mathrm{rk}(A) + \mathrm{rk}(B). \]

So the rank of the difference \(A_k - C\) is at most the sum of the ranks

\[ \mathrm{rk}(A_k - C) \leq \mathrm{rk}(A_k) + \mathrm{rk}(-C) \leq 2 k \]

where we used that the rank of \(A_k\) is \(k\) and the rank of \(C\) is \(\leq k\) since it has \(k\) distinct rows. By the Matrix Norms and Singular Values Lemma,

\[ \|A_k - C\|_F^2 \leq 2k \|A_k - C\|_2^2. \]

By the triangle inequality for matrix norms,

\[ \|A_k - C\|_2 \leq \|A_k - A\|_2 + \|A - C\|_2. \]

By the Low-Rank Approximation in the Induced Norm Theorem,

\[ \|A - A_k\|_2 \leq \|A - C\|_2 \]

since \(C\) has rank at most \(k\). Putting these three inequalities together,

\[ \|A_k - C\|_F^2 \leq 2k (2 \|A - C\|_2)^2 = 8k \|A - C\|_2^2. \]

\(\square\)

NUMERICAL CORNER: We return to our example with the two Gaussian clusters. We use again the function producing to separate clusters.

d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_separated_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)

In reality, we cannot compute the matrix norms of \(X-C\) and \(X_k-C\) as the true centers are not known. But, because this is simulated data, we happen to know the truth and we can check the validity of our results in this case. The centers are:

C1 = np.stack([np.concatenate(([-w], np.zeros(d-1))) for _ in range(n)])
C2 = np.stack([np.concatenate(([w], np.zeros(d-1))) for _ in range(n)])
C = np.concatenate((C1, C2), axis=0)

We use numpy.linalg.svd function to compute the norms from the formulas in the Matrix Norms and Singular Values Lemma. First, we observe that the singular values of \(X-C\) are decaying slowly.

uc, sc, vhc = LA.svd(X-C)
plt.plot(sc)
plt.show()
../../_images/cfa2f8c5bb311184e6e6946af1ff98feaca1e66c917746d594fdb5c31f7588dc.png

The \(k\)-means objective with respect to the true centers under the full-dimensional data is:

frob = np.sum(sc**2)
print(frob)
207905.47916782406

while the square of the top singular value of \(X-C\) is only:

top_sval_sq = sc[0]**2
print(top_sval_sq)
8258.19604762502

Finally, we compute the \(k\)-means objective with respect to the true centers under the projected one-dimensional data:

u, s, vh = LA.svd(X)
frob_proj = np.sum((s[0] * np.outer(u[:,0],vh[:,0]) - C)**2)
print(frob_proj)
8099.057045408984

\(\unlhd\)

3.7.3. Additional proofs#

In this section, we give several proofs that were left out in this chapter.

Proof of the Number of Eigenvalues Lemma We start with a proof that distinct eigenvalues correspond to linearly independent eigenvectors.

Proof: (Number of Eigenvalues) Assume by contradiction that \(\mathbf{x}_1, \ldots, \mathbf{x}_m\) are linearly dependent. By the Linear Dependence Lemma, there is \(k \leq m\) such that

\[ \mathbf{x}_k \in \mathrm{span}(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}) \]

where \(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\) are linearly independent. In particular, there are \(a_1, \ldots, a_{k-1}\) such that

\[ \mathbf{x}_k = a_1 \mathbf{x}_1 + \cdots + a_{k-1} \mathbf{x}_{k-1}. \]

Transform the equation above in two ways: (1) multiply both sides by \(\lambda_k\) and (2) apply \(A\). Then subtract the resulting equations. That leads to

\[ \mathbf{0} = a_1 (\lambda_k - \lambda_1) \mathbf{x}_1 + \cdots + a_{k-1} (\lambda_k - \lambda_{k-1}) \mathbf{x}_{k-1}. \]

Because the \(\lambda_i\)’s are distinct and \(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1}\) are linearly independent, we must have \(a_1 = \cdots = a_{k-1} = 0\). But that implies that \(\mathbf{x}_k = \mathbf{0}\), a contradiction.

For the second claim, if there were more than \(d\) distinct eigenvalues, then there would be more than \(d\) corresponding linearly independent eigenvectors by the first claim, a contradiction. \(\square\)

Proof of Greedy Finds Best Subspace In this section, we prove the Greedy Finds Best Subspace Theorem without the use of the Spectral Theorem.

Proof idea: We proceed by induction. For an arbitrary orthonormal list \(\mathbf{w}_1,\ldots,\mathbf{w}_k\), we find an orthonormal basis of their span containing an element orthogonal to \(\mathbf{v}_1,\ldots,\mathbf{v}_{k-1}\). Then we use the defintion of \(\mathbf{v}_k\) to conclude.

Proof: As we explained in the previous subsection, we reformulate the problem as a maximization

\[ \max \left\{ \sum_{j=1}^k \|A \mathbf{w}_j\|^2\ :\ \text{$\{\mathbf{w}_1,\ldots,\mathbf{w}_k\}$ is an orthonormal list} \right\}, \]

where we also replace the \(k\)-dimensional linear subspace \(\mathcal{Z}\) with an arbitrary orthonormal basis \(\{\mathbf{w}_1,\ldots,\mathbf{w}_k\}\).

We then proceed by induction. For \(k=1\), we define \(\mathbf{v}_1\) as a solution of the above maximization problem. Assume that, for any orthonormal list \(\{\mathbf{w}_1,\ldots,\mathbf{w}_\ell\}\) with \(\ell < k\), we have

\[ \sum_{j=1}^\ell \|A \mathbf{w}_j\|^2 \leq \sum_{j=1}^\ell \|A \mathbf{v}_j\|^2. \]

Now consider any orthonormal list \(\{\mathbf{w}_1,\ldots,\mathbf{w}_k\}\) and let its span be \(\mathcal{W} = \mathrm{span}(\mathbf{w}_1,\ldots,\mathbf{w}_k)\).

Step 1: For \(j=1,\ldots,k-1\), let \(\mathbf{v}_j'\) be the orthogonal projection of \(\mathbf{v}_j\) onto \(\mathcal{W}\) and let \(\mathcal{V}' = \mathrm{span}(\mathbf{v}'_1,\ldots,\mathbf{v}'_{k-1})\). Because \(\mathcal{V}' \subseteq \mathcal{W}\) has dimension at most \(k-1\) while \(\mathcal{W}\) itself has dimension \(k\), we can find an orthonormal basis \(\mathbf{w}'_1,\ldots,\mathbf{w}'_{k}\) of \(\mathcal{W}\) such that \(\mathbf{w}'_k\) is orthogonal to \(\mathcal{V}'\) (Why?). Then, for any \(j=1,\ldots,k-1\), we have the decomposition \(\mathbf{v}_j = \mathbf{v}'_j + (\mathbf{v}_j - \mathbf{v}'_j)\) where \(\mathbf{v}'_j \in \mathcal{V}'\) is orthogonal to \(\mathbf{w}'_k\) and \(\mathbf{v}_j - \mathbf{v}'_j\) is also orthogonal to \(\mathbf{w}'_k \in \mathcal{W}\) by the properties of the orthogonal projection onto \(\mathcal{W}\). Hence

\[\begin{align*} \left\langle \sum_{j=1}^{k-1}\beta_j \mathbf{v}_j, \mathbf{w}'_k \right\rangle &= \left\langle \sum_{j=1}^{k-1}\beta_j [\mathbf{v}'_j + (\mathbf{v}_j - \mathbf{v}'_j)], \mathbf{w}'_k \right\rangle\\ &= \left\langle \sum_{j=1}^{k-1}\beta_j \mathbf{v}'_j, \mathbf{w}'_k \right\rangle + \left\langle \sum_{j=1}^{k-1}\beta_j (\mathbf{v}_j - \mathbf{v}'_j), \mathbf{w}'_k \right\rangle\\ &= \sum_{j=1}^{k-1}\beta_j \left\langle \mathbf{v}'_j, \mathbf{w}'_k \right\rangle + \sum_{j=1}^{k-1}\beta_j \left\langle \mathbf{v}_j - \mathbf{v}'_j, \mathbf{w}'_k \right\rangle\\ &= 0 \end{align*}\]

for any \(\beta_j\)’s. That is, \(\mathbf{w}'_k\) is orthogonal to \(\mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{k-1})\).

Step 2: By the induction hypothesis, we have

\[ (*) \qquad \sum_{j=1}^{k-1} \|A \mathbf{w}'_j\|^2 \leq \sum_{j=1}^{k-1} \|A \mathbf{v}_j\|^2. \]

Moreover, recalling that the \(\boldsymbol{\alpha}_i^T\)’s are the rows of \(A\),

\[ (**) \qquad \sum_{j=1}^k \|A \mathbf{w}_j\|^2 = \sum_{i=1}^n \|\mathrm{proj}_{\mathcal{W}}(\boldsymbol{\alpha}_i)\|^2 = \sum_{j=1}^k \|A \mathbf{w}_j'\|^2 \]

since the \(\mathbf{w}_j\)’s and \(\mathbf{w}'_j\)’s form an orthonormal basis of the same subspace \(\mathcal{W}\). Since \(\mathbf{w}'_k\) is orthogonal to \(\mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_{k-1})\) by the conclusion of Step 1, by the definition of \(\mathbf{v}_k\) as a solution to

\[ \mathbf{v}_k\in \arg\max \{\|A \mathbf{v}\|:\|\mathbf{v}\| = 1, \ \langle \mathbf{v}, \mathbf{v}_j \rangle = 0, \forall j \leq k-1\}. \]

we have

\[ (*\!*\!*) \qquad \|A \mathbf{w}'_k\|^2 \leq \|A \mathbf{v}_k\|^2. \]

Step 3: Putting everything together

\[\begin{align*} \sum_{j=1}^k \|A \mathbf{w}_j\|^2 &= \sum_{j=1}^k \|A \mathbf{w}_j'\|^2 &\text{by $(**)$}\\ &= \sum_{j=1}^{k-1} \|A \mathbf{w}'_j\|^2 + \|A \mathbf{w}'_k\|^2\\ &\leq \sum_{j=1}^{k-1} \|A \mathbf{v}_j\|^2 + \|A \mathbf{v}_k\|^2 &\text{by $(*)$ and $(*\!*\!*)$}\\ &= \sum_{j=1}^{k} \|A \mathbf{v}_j\|^2\\ \end{align*}\]

which proves the claim. \(\square\)

Proof of Existence of the SVD We return to the proof of the Existence of SVD Theorem. We give an alternative proof that does not rely on the Spectral Theorem.

Proof: We have already done most of the work. The proof works as follows:

(1) Compute the greedy sequence \(\mathbf{v}_1,\ldots,\mathbf{v}_r\) from the Greedy Finds Best Subspace Theorem until the largest \(r\) such that

\[ \|A \mathbf{v}_r\|^2 > 0 \]

or, otherwise, until \(r=m\). The \(\mathbf{v}_j\)’s are orthonormal by construction.

(2) For \(j=1,\ldots,r\), let

\[ \sigma_j = \|A \mathbf{v}_j\| \quad\text{and}\quad \mathbf{u}_j = \frac{1}{\sigma_j} A \mathbf{v}_j. \]

Observe that, by our choice of \(r\), the \(\sigma_j\)’s are \(> 0\). They are also non-increasing: by definition of the greedy sequence,

\[ \sigma_i^2 = \max \{\|A \mathbf{w}_i\|^2 :\|\mathbf{w}_i\| = 1, \ \langle \mathbf{w}_i, \mathbf{v}_j \rangle = 0, \forall j \leq i-1\}, \]

where the set of orthogonality constraints gets larger as \(i\) increases. Hence, the \(\mathbf{u}_j\)’s have unit norm by definition.

We show below that they are also orthogonal.

(3) Let \(\mathbf{z} \in \mathbb{R}^m\) be any vector. To show that our construction is correct, we prove that \(A \mathbf{z} = \left(U \Sigma V^T\right)\mathbf{z}\). Let \(\mathcal{V} = \mathrm{span}(\mathbf{v}_1,\ldots,\mathbf{v}_r)\) and decompose \(\mathbf{z}\) into orthogonal components

\[ \mathbf{z} = \mathrm{proj}_{\mathcal{V}}(\mathbf{z}) + (\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})) = \sum_{j=1}^r \langle \mathbf{z}, \mathbf{v}_j\rangle \,\mathbf{v}_j + (\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})). \]

Applying \(A\) and using linearity, we get

\[\begin{align*} A \mathbf{z} &= \sum_{j=1}^r \langle \mathbf{z}, \mathbf{v}_j\rangle \, A\mathbf{v}_j + A (\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})). \end{align*}\]

We claim that \(A (\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})) = \mathbf{0}\). If \(\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z}) = \mathbf{0}\), that is certainly the case. If not, let

\[ \mathbf{w} = \frac{\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})}{\|\mathbf{z} - \mathrm{proj}_{\mathcal{V}}(\mathbf{z})\|}. \]

By definition of \(r\),

\[ 0 = \max \{\|A \mathbf{w}_{r+1}\|^2 :\|\mathbf{w}_{r+1}\| = 1, \ \langle \mathbf{w}_{r+1}, \mathbf{v}_j \rangle = 0, \forall j \leq r\}. \]

Put differently, \(\|A \mathbf{w}_{r+1}\|^2 = 0\) (i.e. \(A \mathbf{w}_{r+1} = \mathbf{0}\)), for any unit vector \(\mathbf{w}_{r+1}\) orthogonal to \(\mathbf{v}_1,\ldots,\mathbf{v}_r\). That applies in particular to \(\mathbf{w}_{r+1} = \mathbf{w}\) by the Orthogonal Projection Theorem.

Hence, using the definition of \(\mathbf{u}_j\) and \(\sigma_j\), we get

\[\begin{align*} A \mathbf{z} &= \sum_{j=1}^r \langle \mathbf{z}, \mathbf{v}_j\rangle \, \sigma_j \mathbf{u}_j\\ &= \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T \mathbf{z}\\ &=\left(\sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T\right)\mathbf{z}\\ &= \left(U \Sigma V^T\right)\mathbf{z}. \end{align*}\]

That proves the existence of the SVD.

All that is left to prove is the orthogonality of the \(\mathbf{u}_j\)’s.

LEMMA (Left Singular Vectors are Orthogonal) For all \(1 \leq i \neq j \leq r\), \(\langle \mathbf{u}_i, \mathbf{u}_j \rangle = 0\). \(\flat\)

Proof idea: Quoting [BHK, Section 3.6]:

Intuitively if \(\mathbf{u}_i\) and \(\mathbf{u}_j\), \(i < j\), were not orthogonal, one would suspect that the right singular vector \(\mathbf{v}_j\) had a component of \(\mathbf{v}_i\) which would contradict that \(\mathbf{v}_i\) and \(\mathbf{v}_j\) were orthogonal. Let \(i\) be the smallest integer such that \(\mathbf{u}_i\) is not orthogonal to all other \(\mathbf{u}_j\). Then to prove that \(\mathbf{u}_i\) and \(\mathbf{u}_j\) are orthogonal, we add a small component of \(\mathbf{v}_j\) to \(\mathbf{v}_i\), normalize the result to be a unit vector \(\mathbf{v}'_i \propto \mathbf{v}_i + \varepsilon \mathbf{v}_j\) and show that \(\|A \mathbf{v}'_i\| > \|A \mathbf{v}_i\|\), a contradiction.

Proof: We argue by contradiction. Let \(i\) be the smallest index such that there is a \(j > i\) with \(\langle \mathbf{u}_i, \mathbf{u}_j \rangle = \delta \neq 0\). Assume \(\delta > 0\) (otherwise use \(-\mathbf{u}_i\)). For \(\varepsilon \in (0,1)\), because the \(\mathbf{v}_k\)’s are orthonormal, \(\|\mathbf{v}_i + \varepsilon \mathbf{v}_j\|^2 = 1+\varepsilon^2\). Consider the vectors

\[ \mathbf{v}'_i = \frac{\mathbf{v}_i + \varepsilon \mathbf{v}_j}{\sqrt{1+\varepsilon^2}} \quad\text{and}\quad A \mathbf{v}'_i = \frac{\sigma_i \mathbf{u}_i + \varepsilon \sigma_j \mathbf{u}_j}{\sqrt{1+\varepsilon^2}}. \]

Observe that \(\mathbf{v}'_i\) is orthogonal to \(\mathbf{v}_1,\ldots,\mathbf{v}_{i-1}\), so that

\[ \|A \mathbf{v}'_i\| \leq \|A \mathbf{v}_i\| =\sigma_i. \]

On the other hand, by the Orthogonal Decomposition Lemma, we can write \(A \mathbf{v}_i'\) as a sum of its orthogonal projection on the unit vector \(\mathbf{u}_i\) and \(A \mathbf{v}_i' - \mathrm{proj}_{\mathbf{u}_i}(A \mathbf{v}_i')\), which is orthogonal to \(\mathbf{u}_i\). In particular, by Pythagoras, \(\|A \mathbf{v}_i'\| \geq \|\mathrm{proj}_{\mathbf{u}_i}(A \mathbf{v}_i')\|\). But that implies, for \(\varepsilon \in (0,1)\),

\[ \|A \mathbf{v}_i'\| \geq \|\mathrm{proj}_{\mathbf{u}_i}(A \mathbf{v}_i')\| = \left\langle \mathbf{u}_i, A \mathbf{v}_i'\right\rangle = \frac{\sigma_i + \varepsilon \sigma_j \delta}{\sqrt{1+\varepsilon^2}} \geq (\sigma_i + \varepsilon \sigma_j \delta) \,(1-\varepsilon^2/2) \]

where the second inequality follows from a Taylor expansion or the observation

\[ (1+\varepsilon^2)\,(1-\varepsilon^2/2)^2 = (1+\varepsilon^2)\,(1-\varepsilon^2 + \varepsilon^4/4) = 1 - 3/4 \varepsilon^4 + \varepsilon^6/4 \leq 1. \]

Now note that

\[\begin{align*} \|A \mathbf{v}_i'\| &\geq (\sigma_i + \varepsilon \sigma_j \delta) \,(1-\varepsilon^2/2)\\ &= \sigma_i + \varepsilon \sigma_j \delta - \varepsilon^2\sigma_i/2 - \varepsilon^3 \sigma_i \sigma_j \delta/2\\ &= \sigma_i + \varepsilon \left( \sigma_j \delta - \varepsilon\sigma_i/2 - \varepsilon^2 \sigma_i \sigma_j \delta/2\right)\\ &> \sigma_i \end{align*}\]

for \(\varepsilon\) small enough, contradicting the inequality above. \(\square\)

SVD and Approximating Subspace In constructing the SVD of \(A\), we used the greedy sequence for the best approximating subspace. Vice versa, given an SVD of \(A\), we can read off the solution to the approximating subspace problem. In other words, there was nothing special about the specific construction we used to prove existence of the SVD. While a matrix may have many SVDs, they all give a solution to the approximating subspace problems.

Further, this perspective gives what is known as a variational characterization of the singular values. We will have more to say about variational characterizations and their uses in the next chapter.

SVD and greedy sequence Indeed, let

\[ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T \]

be an SVD of \(A\) with

\[ \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0. \]

We show that the \(\mathbf{v}_j\)’s form a greedy sequence for the approximating subspace problem. Complete \(\mathbf{v}_1,\ldots,\mathbf{v}_r\) into an orthonormal basis of \(\mathbb{R}^m\) by adding appropriate vectors \(\mathbf{v}_{j+1},\ldots,\mathbf{v}_m\). By construction, \(A\mathbf{v}_{i} = \mathbf{0}\) for all \(i=j+1,\ldots,m\).

We start with the case \(j=1\). For any unit vector \(\mathbf{w} \in \mathbb{R}^m\), we expand it as \(\mathbf{w} = \sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \,\mathbf{v}_i\). By the Properties of Orthonormal Lists,

\[\begin{align*} \|A \mathbf{w}\|^2 &= \left\| A \left( \sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \,\mathbf{v}_i \right) \right\|^2\\ &= \left\| \sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \,A \mathbf{v}_i \right\|^2\\ &= \left\| \sum_{i=1}^r \langle \mathbf{w}, \mathbf{v}_i\rangle \,\sigma_i \mathbf{u}_i \right\|^2\\ &= \sum_{i=1}^r \sigma_i^2 \langle \mathbf{w}, \mathbf{v}_i\rangle^2, \end{align*}\]

where we used the orthonormality of the \(\mathbf{u}_i\)’s and the fact that \(\mathbf{v}_{j+1},\ldots,\mathbf{v}_m\) are in the null space of \(A\). Because the \(\sigma_i\)’s are non-increasing, this last sum is maximized by taking \(\mathbf{w} = \mathbf{v}_1\). So we have shown that \(\mathbf{v}_1 \in \arg\max\{\|A \mathbf{w}\|^2:\|\mathbf{w}\| = 1\}\).

By the Properties of Orthonormal Lists,

\[ \sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle^2 = \left\|\sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \mathbf{v}_i \right\|^2 = \|\mathbf{w}\|^2 = 1. \]

Hence, because the \(\sigma_i\)’s are non-increasing, the sum

\[ \|A \mathbf{w}\|^2 = \sum_{i=1}^r \sigma_i^2 \langle \mathbf{w}, \mathbf{v}_i\rangle^2 \leq \sigma_1^2. \]

This upper bound is achieved by taking \(\mathbf{w} = \mathbf{v}_1\). So we have shown that \(\mathbf{v}_1 \in \arg\max\{\|A \mathbf{w}\|^2:\|\mathbf{w}\| = 1\}\).

More generally, for any unit vector \(\mathbf{w} \in \mathbb{R}^m\) that is orthogonal to \(\mathbf{v}_1,\ldots,\mathbf{v}_{j-1}\), we expand it as \(\mathbf{w} = \sum_{i=j}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \,\mathbf{v}_i\). Then, as long as \(j\leq r\),

\[\begin{align*} \|A \mathbf{w}\|^2 &= \left\| \sum_{i=j}^m \langle \mathbf{w}, \mathbf{v}_i\rangle \,A \mathbf{v}_i \right\|^2\\ &= \left\| \sum_{i=j}^r \langle \mathbf{w}, \mathbf{v}_i\rangle \,\sigma_i \mathbf{u}_i \right\|^2\\ &= \sum_{i=j}^r \sigma_i^2 \langle \mathbf{w}, \mathbf{v}_i\rangle^2\\ &\leq \sigma_j^2. \end{align*}\]

where again we used that the \(\sigma_i\)’s are non-increasing and \(\sum_{i=1}^m \langle \mathbf{w}, \mathbf{v}_i\rangle^2 = 1\). This last bound is achieved by taking \(\mathbf{w} = \mathbf{v}_j\).

So we have shown the following.

THEOREM (Variational Characterization of Singular Values) Let \(A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T\) be an SVD of \(A\) with \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\). Then

\[ \sigma_j^2 = \max \{\|A \mathbf{w}\|^2:\|\mathbf{w}\| = 1, \ \langle \mathbf{w}, \mathbf{v}_i \rangle = 0, \forall i \leq j-1\}, \]

and

\[ \mathbf{v}_j\in \arg\max \{\|A \mathbf{w}\|^2:\|\mathbf{w}\| = 1, \ \langle \mathbf{w}, \mathbf{v}_i \rangle = 0, \forall i \leq j-1\}. \]

\(\sharp\)