Application to dimensionality reduction

\(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\)

3.5. Application to dimensionality reduction#

We discuss an application to principal components analysis and revisit our genetic dataset from ealier in the chapter.

3.5.1. Principal components analysis#

Principal components analysis (PCA) is a commonly used dimensionality reduction approach that is mathematically equivalent closely related to what we described in the previous sections. We formalize the connection.

The data matrix: In PCA we are given \(n\) data points \(\mathbf{x}_1,\ldots,\mathbf{x}_n \in \mathbb{R}^p\) with \(p\) features (i.e., coordinates). We denote the components of \(\mathbf{x}_i\) as \((x_{i1},\ldots,x_{ip})^T\). As usual, we stack them up into a matrix \(X\) whose \(i\)-th row is \(\mathbf{x}_i^T\).

The first step of PCA is to center the data, i.e., we assume that

\[ \frac{1}{n} \sum_{i=1}^n x_{ij} = 0, \qquad \forall j=1,\ldots,p \]

Put differently, the empirical mean of each column is \(0\). Quoting Wikipedia (and this will become clearer below):

Mean subtraction (a.k.a. “mean centering”) is necessary for performing classical PCA to ensure that the first principal component describes the direction of maximum variance. If mean subtraction is not performed, the first principal component might instead correspond more or less to the mean of the data. A mean of zero is needed for finding a basis that minimizes the mean square error of the approximation of the data.

An optional step is to divide each column by the square root of its sample variance, i.e., assume that

\[ \frac{1}{n-1} \sum_{i=1}^n x_{ij}^2 = 1, \qquad \forall j=1,\ldots,p. \]

As we mentioned in a previous chapter, this is particularly important when the features are measured in different units to ensure that their variability can be meaningfully compared.

The first principal component: The first principal component is the linear combination of the features

\[ t_{i1} = \phi_{11} x_{i1} + \cdots + \phi_{p1} x_{ip} \]

with largest sample variance. For this to make sense, we need to constrain the \(\phi_{j1}\)’s. Specifically, we require

\[ \sum_{j=1}^p \phi_{j1}^2 = 1. \]

The \(\phi_{j1}\)’s are referred to as the loadings and the \(t_{i1}\)’s are referred to as the scores.

Formally, we seek to solve

\[ \max\left\{\frac{1}{n-1} \sum_{i=1}^n \left(\sum_{j=1}^p \phi_{j1} x_{ij}\right)^2\ :\ \sum_{j=1}^p \phi_{j1}^2 = 1\right\}, \]

where we used the fact that the \(t_{i1}\)’s are centered

\[\begin{align*} \frac{1}{n} \sum_{i=1}^n t_{i1} &= \frac{1}{n} \sum_{i=1}^n [\phi_{11} x_{i1} + \cdots + \phi_{p1} x_{ip}]\\ &= \phi_{11} \frac{1}{n} \sum_{i=1}^n x_{i1} + \cdots + \phi_{p1} \frac{1}{n} \sum_{i=1}^n x_{ip}\\ &= 0, \end{align*}\]

to compute their sample variance as the mean of their square

\[ \frac{1}{n-1}\sum_{i=1}^n t_{i1}^2 = \frac{1}{n-1} \sum_{i=1}^n \left(\sum_{j=1}^p \phi_{j1} x_{ij}\right)^2. \]

Let \(\boldsymbol{\phi}_1 = (\phi_{11},\ldots,\phi_{p1})^T\) and \(\mathbf{t}_1 = (t_{11},\ldots,t_{n1})^T\). Then for all \(i\)

\[ t_{i1} = \mathbf{x}_i^T \boldsymbol{\phi}_1, \]

or in vector form

\[ \mathbf{t}_1 = X \boldsymbol{\phi}_1. \]

Also

\[ \frac{1}{n-1}\sum_{i=1}^n t_{i1}^2 = \frac{1}{n-1} \|\mathbf{t}_1\|^2 = \frac{1}{n-1} \|X \boldsymbol{\phi}_1\|^2. \]

Rewriting the maximization problem above in matrix form,

\[ \max\left\{\frac{1}{n-1} \|X \boldsymbol{\phi}_1\|^2 : \|\boldsymbol{\phi}_1\|^2 = 1\right\}, \]

we see that we have already encountered this problem (up to the factor of \(1/(n-1)\) which does not affect the solution). The solution is to take \(\boldsymbol{\phi}_1\) to be the top right singular vector of \(X\).

The second principal component: The second principal component is the linear combination of the features

\[ t_{i2} = \phi_{12} x_{i1} + \cdots + \phi_{p2} x_{ip} \]

with largest sample variance that is also uncorrelated with the first principal component, in the sense that

\[ \frac{1}{n-1} \sum_{i=1}^n t_{i1} t_{i2} = 0. \]

The next lemma shows how to deal with this condition. Again, we also require

\[ \sum_{j=1}^p \phi_{j2}^2 = 1. \]

As before, let \(\boldsymbol{\phi}_2 = (\phi_{12},\ldots,\phi_{p2})^T\) and \(\mathbf{t}_2 = (t_{12},\ldots,t_{n2})^T\).

LEMMA (Uncorrelated Principal Components) Assume \(X \neq \mathbf{0}\). Let \(t_{i1}\), \(t_{i2}\), \(\boldsymbol{\phi}_1\), \(\boldsymbol{\phi}_2\) be as above. Then

\[ \frac{1}{n-1} \sum_{i=1}^n t_{i1} t_{i2} = 0 \]

holds if and only if

\[ \langle \boldsymbol{\phi}_1, \boldsymbol{\phi}_2 \rangle = 0. \]

\(\flat\)

Proof: The condition

\[ \frac{1}{n-1} \sum_{i=1}^n t_{i1} t_{i2} = 0 \]

is equivalent to

\[ \langle \mathbf{t}_1, \mathbf{t}_2 \rangle = 0, \]

where we dropped the \(1/n\) factor as it does not play any role. Using that \(\mathbf{t}_1 = X \boldsymbol{\phi}_1\), and similarly, \(\mathbf{t}_2 = X \boldsymbol{\phi}_2\), this is in turn equivalent to

\[ \langle X \boldsymbol{\phi}_1, X \boldsymbol{\phi}_2 \rangle = 0. \]

Because \(\boldsymbol{\phi}_1\) can be chosen as a top right singular vector in an SVD of \(X\), it follows from the SVD Relations that \(X^T X \boldsymbol{\phi}_1 = \sigma_1^2 \boldsymbol{\phi}_1\), where \(\sigma_1\) is the singular value associated to \(\boldsymbol{\phi}_1\). Since \(X \neq 0\), \(\sigma_1 > 0\). Plugging this in the inner product on the left hand side above, we get

\[\begin{align*} \langle X \boldsymbol{\phi}_1, X \boldsymbol{\phi}_2 \rangle &= \langle X \boldsymbol{\phi}_2, X \boldsymbol{\phi}_1 \rangle\\ &= (X \boldsymbol{\phi}_2)^T (X \boldsymbol{\phi}_1)\\ &= \boldsymbol{\phi}_2^T X^T X \boldsymbol{\phi}_1\\ &= \boldsymbol{\phi}_2^T (\sigma_1^2 \boldsymbol{\phi}_1)\\ &= \langle \boldsymbol{\phi}_2, \sigma_1^2 \boldsymbol{\phi}_1 \rangle\\ &= \sigma_1^2 \langle \boldsymbol{\phi}_1, \boldsymbol{\phi}_2 \rangle. \end{align*}\]

Because \(\sigma_1 \neq 0\), this is \(0\) if and only if \(\langle \boldsymbol{\phi}_1, \boldsymbol{\phi}_2 \rangle = 0\). \(\square\)

As a result, we can write the maximization problem for the second principal component in matrix form as

\[ \max\left\{\frac{1}{n-1} \|X \boldsymbol{\phi}_2\|^2 : \|\boldsymbol{\phi}_2\|^2 = 1, \langle \boldsymbol{\phi}_1, \boldsymbol{\phi}_2 \rangle = 0\right\}. \]

Again, we see that we have encountered this problem before. The solution is to take \(\boldsymbol{\phi}_2\) to be a second right singular vector in an SVD of \(X\).

Further principal components: We can proceed in a similar fashion and define further principal components.

To quote Wikipedia:

PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information (see below). PCA is often used in this manner for dimensionality reduction. PCA has the distinction of being the optimal orthogonal transformation for keeping the subspace that has largest “variance” (as defined above).

Formally, let

\[ X = U \Sigma V^T \]

be the SVD of the data matrix \(X\). The principal component transformation, truncated at the \(\ell\)-th component, is

\[ T = X V_{(\ell)}, \]

where \(T\) is the matrix whose colums are the vectors \(\mathbf{t}_1,\ldots,\mathbf{t}_\ell\). Recall that \(V_{(\ell)}\) is the matrix made of the first \(k\) columns of \(V\).

Then, using the orthonormality of the right singular vectors,

\[\begin{split} T = U \Sigma V^T V_{(\ell)} = U \Sigma \begin{bmatrix} I_{\ell \times \ell}\\ \mathbf{0}_{(p-\ell)\times \ell}\end{bmatrix} = U \begin{bmatrix}\Sigma_{(\ell)}\\ \mathbf{0}_{(p-\ell)\times \ell}\end{bmatrix} = U_{(\ell)} \Sigma_{(\ell)}. \end{split}\]

Put differently, the vector \(\mathbf{t}_i\) is the left singular vector \(\mathbf{u}_i\) scaled by the corresponding singular value \(\sigma_i\).

NUMERICAL CORNER: Having established a formal connection between PCA and SVD, we implement PCA using our SVD algorithm (in mmids.py). We perform mean centering (now is the time to read that quote about the importance of mean centering again), but not the optional standardization. We use the fact that, in Numpy, subtracting a matrix by a vector whose dimension matches the number of columns performs row-wise subtraction.

def pca(X, l, maxiter=100):
    mean = np.mean(X, axis=0)  # Compute mean of each column
    Y = X - mean # Mean center each column
    U, S, V = mmids.svd(Y, l, maxiter=maxiter)
    return U[:,:l] @ np.diag(S[:l])

We apply it to the Gaussian Mixture Model.

d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(d, n, w)
T = pca(X, 2)

Plotting the result, we see that PCA does succeed in finding the main direction of variation, with the first principal component aligning well with the first coordinate.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(T[:,0], T[:,1], s=10)
plt.show()
../../_images/dfce002583db3b62433fb641b4cb2957916372166eb388a3112e5d304e2138af.png

Note however that the first two principal components (especially the second one) in fact “capture more noise” than what can be seen in the first two coordinates, a form of overfitting.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], s=10)
plt.show()
../../_images/85dbf540775b5eceb038ce039144e0fecf3b7bb07f4a1a52337ec09365d788ac.png

TRY IT! Compute the first two right singular vectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\) of \(X\) after mean centering. Do they align well with the first and second standard basis vectors \(\mathbf{e}_1\) and \(\mathbf{e}_2\)? Why or why not? (Open in Colab)

\(\unlhd\)

3.5.2. Back to our genetic dataset#

We return to our motivating example. We apply PCA to our genetic dataset.

Figure: Viruses (Credit: Made with Midjourney)

Viruses

\(\bowtie\)

We load the dataset again and examine its first rows.

df = pd.read_csv('h3n2-snp.csv')
df.head()
strain s6a s6c s6g s17a s17g s17t s39a s39c s39g ... s977g s978a s978c s978g s978t s979a s979c s979t s980a s980c
0 AB434107 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
1 AB434108 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
2 CY000113 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
3 CY000209 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
4 CY000217 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

5 rows × 318 columns

Recall that it contains \(1642\) strains and lives in a \(318\)-dimensional space.

df.shape
(1642, 318)

Our goal is to find a “good” low-dimensional representation of the data. Ten dimensions will do here. We use PCA.

A = df[[df.columns[i] for i in range(1,len(df.columns))]].to_numpy()
n_dims = 10
T = pca(A, n_dims)

We plot the first two principal components, and see what appears to be some potential structure.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(T[:,0], T[:,1], s=10)
plt.show()
../../_images/246b0fc80316c05a13788678344ac437ba67bf340b29674d30fc93de590f34dc.png

There seems to be some reasonably well-defined clusters in this projection. We use \(k\)-means to identiy clusters. We take advantage of the implementation in scikit-learn, sklearn.cluster.KMeans. By default, it finds \(8\) clusters. The clusters can be extracted from the attribute labels_.

from sklearn.cluster import KMeans
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10).fit(T)
assign = kmeans.labels_

To further reveal the structure, we look at our the clusters spread out over the years. That information is in a separate file.

dfoth = pd.read_csv('h3n2-other.csv')
dfoth.head()
strain length country year lon lat date
0 AB434107 1701 Japan 2002 137.215474 35.584176 2002/02/25
1 AB434108 1701 Japan 2002 137.215474 35.584176 2002/03/01
2 CY000113 1762 USA 2002 -73.940000 40.670000 2002/01/29
3 CY000209 1760 USA 2002 -73.940000 40.670000 2002/01/17
4 CY000217 1760 USA 2002 -73.940000 40.670000 2002/02/26
year = dfoth['year'].to_numpy()

For each cluster, we plot how many of its data points come from a specific year. Each cluster has a different color.

Hide code cell source
fig, ax = plt.subplots()

for i in range(n_clusters):
    unique, counts = np.unique(year[assign == i], return_counts=True)
    ax.bar(unique, counts, label=i)

ax.set(xlim=(2001, 2007), xticks=np.arange(2002, 2007))
ax.legend()
plt.show()
../../_images/7917f5e954b887c48494ba6bbcad3a4b484c56edb776140ace88cd124e005789.png

Remarkably, we see that each cluster comes mostly from one year or two consecutive ones. In other words, the clustering in this low-dimensional projection captures some true underlying structure that is not explicitly in the genetic data on which it is computed.

Going back to the first two principal components, we color the points on the scatterplot by year. (We use legend_elements() for automatic legend creation.)

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
scatter = ax.scatter(T[:,0], T[:,1], s=10,  c=year, label=year)
plt.legend(*scatter.legend_elements())
plt.show()
../../_images/cedd0395cbcb68a829f20ce3c3fcabfe659ba4023c2636870217a85ef97e2bd3.png

To some extent, one can “see” the virus evolving from year to year. The \(x\)-axis in particular seems to correlate strongly with the year, in the sense that samples from later years tend to be towards one side of the plot.

To further quantify this observation, we use numpy.corrcoef to compute the correlation coefficients between the year and the first \(10\) principal components.

corr = np.zeros(n_dims)
for i in range(n_dims):
    corr[i] = np.corrcoef(np.stack((T[:,i], year)))[0,1]

print(corr)
[-0.7905001  -0.42806325  0.0870437  -0.16841178 -0.05751986  0.06046913
  0.07913293 -0.01474969  0.02544753 -0.04315045]

Indeed, we see that the first three or four principal components correlate well with the year.

Using related techniques, one can also identify which mutations distinguish different epidemics (i.e., years).

LEARNING BY CHATTING Ask your favorite AI chatbots about the difference between principal components analysis (PCA) and linear discriminant analysis (LDA).