\(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\)

3.4. Power iteration#

There is in general no exact method for computing SVDs. Instead we must rely on iterative methods, that is, methods that progressively approach the solution. We describe in this section the power iteration method. This method is behind an effective numerical approach for computing SVDs.

The focus here is on numerical methods and we will not spend much time computing SVDs by hand. But note that the connection between the SVD and the spectral decomposition of \(A^T A\) can be used for this purpose on small examples.

3.4.1. Key lemma#

We now derive the main idea behind an algorithm to compute singular vectors.

Eigenvectors in semidefinite case We start with eigenvectors. For our purposes, it will suffice to (1) restrict ourselves to positive semidefinite matrices and (2) focus on the largest eigenvalue.

Let \(A\) be a symmetric, positive semindefinite matrix in \(\mathbb{R}^{d \times d}\). By the Spectral Theorem, it has an eigenvector decomposition

\[ A = Q \Lambda Q^T = \sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^T \]

where further \(0 \leq \lambda_d \leq \cdots \leq \lambda_1\) by the Characterization of Positive Semidefiniteness. To see how one might go about computing it, we start with the following observation.

Because of the orthogonality of \(Q\), the powers of \(A\) have a simple representation. The square gives

\[ A^2 = (Q \Lambda Q^T) (Q \Lambda Q^T) = Q \Lambda^2 Q^T. \]

Repeating, we obtain

\[ A^{k} = Q \Lambda^{k} Q^T. \]

Further

\[\begin{split} \Lambda^k = \begin{pmatrix} \lambda_1^{k} & 0 & \cdots & 0\\ 0 & \lambda_2^{k} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \lambda_d^{k} \end{pmatrix}. \end{split}\]

When \(\lambda_1 > \lambda_2, \ldots, \lambda_d\), we get that \(\lambda_1^{k} \gg \lambda_2^{k}, \ldots, \lambda_d^{k}\) when \(k\) is large. Then, we get the approximation

\[ A^{k} = \sum_{i=1}^d \lambda_i^{k} \mathbf{q}_i \mathbf{q}_i^T \approx \lambda_1^{k} \mathbf{q}_1 \mathbf{q}_1^T. \]

This leads to the following idea to compute \(\mathbf{q}_1\):

LEMMA (Power Iteration in the Positive Semidefinite Case) Let \(A\) be a symmetric, positive semindefinite matrix in \(\mathbb{R}^{d \times d}\) with eigenvector decomposition \(A= Q \Lambda Q^T\) where the eigenvalues satisfy \(0 \leq \lambda_d \leq \cdots \leq \lambda_2 < \lambda_1\). Assume that \(\mathbf{x} \in \mathbb{R}^d\) is a vector such that \(\langle \mathbf{q}_1, \mathbf{x} \rangle > 0\). Then

\[ \frac{A^{k} \mathbf{x}}{\|A^{k} \mathbf{x}\|} \to \mathbf{q}_1 \]

as \(k \to +\infty\). If instead \(\langle \mathbf{q}_1, \mathbf{x} \rangle < 0\), then the limit is \(- \mathbf{q}_1\). \(\flat\)

The proof is similar to the case of singular vectors, which is detailed next.

SVD Let \(U \Sigma V^T\) be a (compact) SVD of \(A\). To see how one might go about computing it, we start with the following observation which relies on a different way to state the connection between the SVD and the spectral decomposition of \(A^T A\). Because of the orthogonality of \(U\) and \(V\), the powers of \(A^T A\) have a simple representation. Indeed

\[ B = A^T A = (U \Sigma V^T)^T (U \Sigma V^T) = V \Sigma^T U^T U \Sigma V^T = V \Sigma^T \Sigma V^T, \]

so that

\[ B^2 = (V \Sigma^T \Sigma V^T) (V \Sigma^T \Sigma V^T) = V (\Sigma^T \Sigma)^2 V^T. \]

Repeating

\[ B^{k} = V (\Sigma^T \Sigma)^{k} V^T. \]

Further,

\[\begin{split} \widetilde{\Sigma} = \Sigma^T \Sigma = \begin{pmatrix} \sigma_1^2 & 0 & \cdots & 0\\ 0 & \sigma_2^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \sigma_r^2 \end{pmatrix}. \end{split}\]

So

\[\begin{split} \widetilde{\Sigma}^k = \begin{pmatrix} \sigma_1^{2k} & 0 & \cdots & 0\\ 0 & \sigma_2^{2k} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \sigma_r^{2k} \end{pmatrix}. \end{split}\]

When \(\sigma_1 > \sigma_2, \ldots, \sigma_r\), which is often the case with real datasets, we get that \(\sigma_1^{2k} \gg \sigma_2^{2k}, \ldots, \sigma_r^{2k}\) when \(k\) is large. Then, we get the approximation

\[ B^{k} = \sum_{j=1}^r \sigma_j^{2k} \mathbf{v}_j \mathbf{v}_j^T \approx \sigma_1^{2k} \mathbf{v}_1 \mathbf{v}_1^T. \]

Finally, we arrive at:

LEMMA (Power Iteration) Let \(A \in \mathbb{R}^{n\times m}\) be a matrix and let \(U \Sigma V^T\) be a (compact) SVD of \(A\) such that \(\sigma_1 > \sigma_2 > 0\). Define \(B = A^T A\) and assume that \(\mathbf{x} \in \mathbb{R}^m\) is a vector satisfying \(\langle \mathbf{v}_1, \mathbf{x} \rangle > 0\). Then

\[ \frac{B^{k} \mathbf{x}}{\|B^{k} \mathbf{x}\|} \to \mathbf{v}_1 \]

as \(k \to +\infty\). If instead \(\langle \mathbf{v}_1, \mathbf{x} \rangle < 0\), then the limit is \(- \mathbf{v}_1\). \(\flat\)

Proof idea: We use the approximation above and divide by the norm to get a unit norm vector in the direction of \(\mathbf{v}_1\).

Proof: We have

\[ B^{k}\mathbf{x} = \sum_{j=1}^r \sigma_j^{2k} \mathbf{v}_j \mathbf{v}_j^T \mathbf{x}. \]

So

\[\begin{align*} \frac{B^{k} \mathbf{x}}{\|B^{k} \mathbf{x}\|} &= \sum_{j=1}^r \mathbf{v}_j \frac{\sigma_j^{2k} (\mathbf{v}_j^T \mathbf{x})} {\|B^{k} \mathbf{x}\|}\\ &= \mathbf{v}_1 \left\{\frac{\sigma_1^{2k} (\mathbf{v}_1^T \mathbf{x})} {\|B^{k} \mathbf{x}\|}\right\} + \sum_{j=2}^r \mathbf{v}_j \left\{\frac{\sigma_j^{2k} (\mathbf{v}_j^T \mathbf{x})} {\|B^{k} \mathbf{x}\|}\right\}. \end{align*}\]

This goes to \(\mathbf{v}_1\) as \(k\to +\infty\) if the expression in the first curly brackets goes to \(1\) and the one in the second curly brackets goes to \(0\). We prove this in the next claim.

LEMMA As \(k\to +\infty\),

\[ \frac{\sigma_1^{2k} (\mathbf{v}_1^T \mathbf{x})} {\|B^{k} \mathbf{x}\|} \to 1 \qquad \text{and} \qquad \frac{\sigma_j^{2k} (\mathbf{v}_j^T \mathbf{x})} {\|B^{k} \mathbf{x}\|} \to 0, \ j = 2,\ldots,r. \]

\(\flat\)

Proof: Because the \(\mathbf{v}_j\)s are an orthonormal basis,

\[ \|B^{k}\mathbf{x}\|^2 = \sum_{j=1}^r \left[\sigma_j^{2k} (\mathbf{v}_j^T \mathbf{x})\right]^2 = \sum_{j=1}^r \sigma_j^{4k} (\mathbf{v}_j^T \mathbf{x})^2. \]

So, as \(k\to +\infty\), using the fact that \(\langle \mathbf{v}_1, \mathbf{x} \rangle \neq 0\) by assumption

\[\begin{align*} \frac{\|B^{k}\mathbf{x}\|^2}{\sigma_1^{4k} (\mathbf{v}_1^T \mathbf{x})^2} &= 1 + \sum_{j=2}^r \frac{\sigma_j^{4k} (\mathbf{v}_j^T \mathbf{x})^2}{\sigma_1^{4k} (\mathbf{v}_1^T \mathbf{x})^2}\\ &= 1 + \sum_{j=2}^r \left(\frac{\sigma_j}{\sigma_1}\right)^{4k} \frac{(\mathbf{v}_j^T \mathbf{x})^2}{(\mathbf{v}_1^T \mathbf{x})^2}\\ &\to 1, \end{align*}\]

since \(\sigma_j < \sigma_1\) for all \(j =2,\ldots,r\). That implies the first part of the claim by taking a square root and using \(\langle \mathbf{v}_1, \mathbf{x} \rangle > 0\). The second part of the claim follows essentially from the same argument. \(\square\) \(\square\)

EXAMPLE: We revisit the example

\[\begin{split} A = \begin{pmatrix} 1 & 0\\ -1 & 0 \end{pmatrix}. \end{split}\]

We previously compute its SVD and found that

\[\begin{split} \mathbf{v}_1 = \begin{pmatrix} 1\\ 0 \end{pmatrix}. \end{split}\]

This time we use the Power Iteration Lemma. Here

\[\begin{split} B = A^T A = \begin{pmatrix} 2 & 0\\ 0 & 0 \end{pmatrix}. \end{split}\]

Taking powers of this matrix is easy

\[\begin{split} B^k = \begin{pmatrix} 2^k & 0\\ 0 & 0 \end{pmatrix}. \end{split}\]

Let’s choose an arbitrary initial vector \(\mathbf{x}\), say \((-1, 2)^T\). Then

\[\begin{split} B^k \mathbf{x} = \begin{pmatrix} -2^k\\ 0 \end{pmatrix} \quad \text{and} \quad \|B^k \mathbf{x}\| = 2^k. \end{split}\]

So

\[\begin{split} \frac{B^{k} \mathbf{x}}{\|B^{k} \mathbf{x}\|} \to \begin{pmatrix} -1\\ 0 \end{pmatrix} = - \mathbf{v}_1, \end{split}\]

as \(k \to +\infty\). In fact, in this case, convergence occurs after one step. \(\lhd\)

3.4.2. Computing the top singular vector#

Power iteration gives us a way to compute \(\mathbf{v}_1\) – at least approximately if we use a large enough \(k\). But how do we find an appropriate vector \(\mathbf{x}\), as required by the Power Iteration Lemma? It turns out that a random vector will do. For instance, let \(\mathbf{X}\) be an \(m\)-dimensional spherical Gaussian with mean \(0\) and variance \(1\). Then, \(\mathbb{P}[\langle \mathbf{v}_1, \mathbf{X} \rangle = 0] = 0\). We will show this when we discuss multivariate Gaussians. Note that, if \(\langle \mathbf{v}_1, \mathbf{X} \rangle < 0\), we will instead converge to \(-\mathbf{v}_1\) which is also a right singular vector.

NUMERICAL CORNER: We implement the algorithm suggested by the Power Iteration Lemma. That is, we compute \(B^{k} \mathbf{x}\), then normalize it. To obtain the corresponding singular value and left singular vector, we use that \(\sigma_1 = \|A \mathbf{v}_1\|\) and \(\mathbf{u}_1 = A \mathbf{v}_1/\sigma_1\).

def topsing(A, maxiter=10):
    x = rng.normal(0,1,np.shape(A)[1])
    B = A.T @ A
    for _ in range(maxiter):
        x = B @ x
    v = x / LA.norm(x)
    s = LA.norm(A @ v)
    u = A @ v / s
    return u, s, v

We will apply it to our previous two-cluster example. The necessary functions are in mmids.py, which is available on the GitHub of the notes.

d, n, w = 10, 100, 3.
X = mmids.two_mixed_clusters(d, n, w)
Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], marker='x')
plt.show()
../../_images/75bbb24d549192c978ff1cd20fe3101a4db8c989a9e4683a67b7a028580ba5be.png

Let’s compute the top singular vector.

u, s, v = topsing(X)
print(v)
[-0.99403346  0.06169445  0.04646685 -0.00370871  0.02909716 -0.02980242
 -0.02583767  0.04676195 -0.03515519 -0.00966285]

This is approximately \(-\mathbf{e}_1\). We get roughly the same answer (possibly up to sign) from Python’s numpy.linalg.svd function.

u, s, vh = LA.svd(X)
print(vh.T[:,0])
[ 0.99403346 -0.06169444 -0.04646684  0.00370871 -0.02909717  0.02980244
  0.02583767 -0.04676197  0.03515518  0.00966286]

Recall that, when we applied \(k\)-means clustering to this example with \(d=1000\) dimension, we obtained a very poor clustering.

d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(d, n, w)
Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], marker='x')
plt.show()
../../_images/2fbbc628607bd4cb57eabf1ad3b5a37c408ef3f4fc1b10dc5b0d91de7b03a495.png
assign = mmids.kmeans(X, 2)
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
99641.677051717
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/a8b4b2809d0d5d1335e378986056c1510983b80cb3927607efa37525021db0da.png

Let’s try again, but after projecting on the top singular vector. Recall that this corresponds to finding the best one-dimensional approximating subspace. The projection can be computed using the truncated SVD \(Z= U_{(1)} \Sigma_{(1)} V_{(1)}^T\). We can interpret the rows of \(U_{(1)} \Sigma_{(1)}\) as the coefficients of each data point in the basis \(\mathbf{v}_1\). We will work in that basis. We need one small hack: because our implementation of \(k\)-means clustering expects data points in at least \(2\) dimension, we add a column of \(0\)’s.

u, s, v = topsing(X)
Xproj = np.stack((u*s, np.zeros(np.shape(X)[0])), axis=-1)
Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(Xproj[:,0], Xproj[:,1], alpha=0.25)
plt.ylim([-3,3])
plt.show()
../../_images/775e01421b3601752493e6eed2fa27793f2e084268f7dab4aaf670783e8844ac.png

There is a small – yet noticeable – gap around 0.

A histogram of the first component of Xproj gives a better sense of the density of points.

Hide code cell source
plt.hist(Xproj[:,0])
plt.show()
../../_images/aac38402f4e2188c95be92a218c67aff0e695b7601889a68994fddc321d0299e.png

We run \(k\)-means clustering on the projected data.

assign = mmids.kmeans(Xproj, 2)
1309.1441138296427
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
386.8878320256463
Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/874cf0d8a0dd8afa53f1b7b5bbde1ccf6ac120c60e0eb8ac1f4658e9d4fa4a93.png

Much better. We give a more formal explanation of this outcome in an optional section. In essence, quoting [BHK, Section 7.5.1]:

[…] let’s understand the central advantage of doing the projection to [the top \(k\) right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.

Finally, looking at the top right singular vector (or its first ten entries for lack of space), we see that it does align quite well (but not perfectly) with the first dimension.

print(v[:10])
[ 0.62787593  0.02260667 -0.0020277   0.01371159  0.01163581  0.01610368
  0.00903951 -0.01153729 -0.00906383 -0.00990353]

\(\unlhd\)

3.4.3. Computing more singular vectors#

We have shown how to compute the first singular vector. How do we compute more singular vectors? One approach is to first compute \(\mathbf{v}_1\) (or \(-\mathbf{v}_1\)), then find a vector \(\mathbf{y}\) orthogonal to it, and proceed as above. And then we repeat until we have all \(m\) right singular vectors.

We are often interested only in the top, say \(\ell < m\), singular vectors. An alternative approach in that case is to start with \(\ell\) random vectors and, first, find an orthonormal basis for the space they span. Then to quote [BHK, Section 3.7.1]:

Then compute \(B\) times each of the basis vectors, and find an orthonormal basis for the space spanned by the resulting vectors. Intuitively, one has applied \(B\) to a subspace rather than a single vector. One repeatedly applies \(B\) to the subspace, calculating an orthonormal basis after each application to prevent the subspace collapsing to the one dimensional subspace spanned by the first singular vector.

We will not prove here that this approach, known as orthogonal iteration, works. The proof is similar to that of the Power Iteration Lemma.

NUMERICAL CORNER: We implement this last algorithm. We will need our previous implementation of Gram-Schimdt.

def svd(A, l, maxiter=100):
    V = rng.normal(0,1,(np.size(A,1),l))
    for _ in range(maxiter):
        W = A @ V
        Z = A.T @ W
        V, R = mmids.gramschmidt(Z)
    W = A @ V
    S = [LA.norm(W[:, i]) for i in range(np.size(W,1))]
    U = np.stack([W[:,i]/S[i] for i in range(np.size(W,1))],axis=-1)
    return U, S, V

Note that above we avoided forming the matrix \(A^T A\). With a small number of iterations, that approach potentially requires fewer arithmetic operations overall and it allows to take advantage of the possible sparsity of \(A\) (i.e. the fact that it may have many zeros).

We apply it again to our two-cluster example.

d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(d, n, w)

Let’s try again, but after projecting on the top two singular vectors. Recall that this corresponds to finding the best two-dimensional approximating subspace. The projection can be computed using the truncated SVD \(Z= U_{(2)} \Sigma_{(2)} V_{(2)}^T\). We can interpret the rows of \(U_{(2)} \Sigma_{(2)}\) as the coefficients of each data point in the basis \(\mathbf{v}_1,\mathbf{v}_2\). We will work in that basis.

U, S, V = svd(X, 2)
Xproj = np.stack((U[:,0]*S[0], U[:,1]*S[1]), axis=-1)
assign = mmids.kmeans(Xproj, 2)
3329.139169940858
2525.8080945260817
2513.7186850891817
2499.8370677155963
2498.548904664579
2498.548904664579
2498.548904664579
2498.548904664579
2498.548904664579
2498.548904664579
Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/7b14c608177f789b2ae8977b06f2298c59e4c2d5cfd29f7f034defaf00ff5d96.png

Finally, looking at the first two right singular vectors, we see that the first one does align quite well with the first dimension.

print(np.stack((V[:,0], V[:,1]), axis=-1))
[[ 0.65108828  0.04380735]
 [-0.00804941  0.032942  ]
 [-0.03645334 -0.00097783]
 ...
 [-0.00579195 -0.01518118]
 [ 0.01448518 -0.02433161]
 [-0.05940409 -0.01091726]]

\(\unlhd\)