Some observations about high-dimensional data

1.4. Some observations about high-dimensional data#

In this section, we first apply \(k\)-means clustering to a high-dimensional example to illustrate the issues that arise in that context. We then discuss some surprising phenomena in high dimensions.

1.4.1. Clustering in high dimension#

In this section, we test our implementation of \(k\)-means on a simple simulated dataset in high dimension.

The following function generates \(n\) data points from a mixture of two equally likely, spherical \(d\)-dimensional Gaussians with variance \(1\), one with mean \(-w\mathbf{e}_1\) and one with mean \(w \mathbf{e}_1\). We use gmm2 from a previous section. It is found in mmids.py, which can be downloaded here.

def two_mixed_clusters(d, n, w):
    
    # set parameters
    phi0 = 0.5
    phi1 = 0.5
    mu0 = np.concatenate(([w], np.zeros(d-1)))
    mu1 = np.concatenate(([-w], np.zeros(d-1)))
    sig0 = 1
    sig1 = 1
    
    return mmids.gmm2spherical(
        d, n, phi0, phi1, mu0, sig0, mu1, sig1)

Two dimensions We start with \(d=2\).

d, n, w = 2, 100, 3.
X = two_mixed_clusters(d, n, w)

We use a scatterplot to visualize the data. Each dot corresponds to one data point. Observe the two clearly delineated clusters.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], marker='x')
plt.show()
../../_images/337a13b27ccff90d15501264e8a2b640cc40806a186e53be142d57b90720d157.png

Let’s run \(k\)-means on this dataset using \(k=2\). We use kmeans() from the mmids.py file.

assign = mmids.kmeans(X, 2)
1044.8267883490312
208.5284166285488
204.02397716710018
204.02397716710018
204.02397716710018
204.02397716710018
204.02397716710018
204.02397716710018
204.02397716710018
204.02397716710018

Our default of \(10\) iterations seem to have been enough for the algorithm to converge. We can visualize the result by coloring the points according to the assignment.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/98e855272965f58642b880fd229af5bdd35252156d954adafc36fa73942007bd.png

General dimension Let’s see what happens in higher dimension. We repeat our experiment with \(d=1000\).

d, n, w = 1000, 100, 3.
X = two_mixed_clusters(d, n, w)

Again, we observe two clearly delineated clusters.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], marker='x')
plt.show()
../../_images/39781aecf1a0f40c8547755a933043e4713dfe376857a564bf398c32da944f07.png

This dataset is in \(1000\) dimensions, but we’ve plotted the data in only the first two dimensions. If we plot in any two dimensions not including the first one instead, we see only one cluster.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,1], X[:,2], marker='x')
plt.show()
../../_images/44a116868c7e30092e16290eeb987bbfec6e0f98060d62631ebb866bc0eccd35.png

Let’s see how \(k\)-means fares on this dataset.

assign = mmids.kmeans(X, 2)
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592

Our attempt at clustering does not appear to have been successful.

Hide code cell source
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/c0478d6c8406347d9a24f211889fadcc5214820ed33bc2508afec848bf4ad543.png

Our attempt at clustering does not appear to have been successful. What happened? While these clusters are easy to tease apart if we know to look at the first coordinate only, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal.

The function below plots the histograms of within-cluster and between-cluster distances for a sample of size \(n\) in \(d\) dimensions with a given offset. As \(d\) increases, the two distributions become increasingly indistinguishable. Later in the course, we will develop dimension-reduction techniques that help deal with this issue. This time, we need labeled data points, i.e., we need to know the component from which each data point comes from. For this purpose, we implement a variant of our previous mixture simulator. This one generates – separately – each Gaussian and concatenates the data. So the first n samples are from one component and the next n samples are from the other.

def two_separated_clusters(d, n, w):
    
    # set parameters
    mu0 = np.concatenate(([w], np.zeros(d-1)))
    mu1 = np.concatenate(([-w], np.zeros(d-1)))
    sig0 = 1
    sig1 = 1
    
    # generate samples
    X0 = mmids.spherical_gaussian(d, n, mu0, sig0)
    X1 = mmids.spherical_gaussian(d, n, mu1, sig1)
    
    return X0, X1
def highdim_2clusters(d, n, w):
    # generate datasets
    X0, X1 = two_separated_clusters(d, n, w)
    
    # within-cluster distances for X1
    intra = np.stack(
        [LA.norm(X0[i,:] - X0[j,:]) for i in range(n) for j in range(n) if j>i]
    )
    plt.hist(intra, density=True, label='within-cluster')
    plt.title(f'dim={d}')
 
    # between-cluster distances
    inter = np.stack(
        [LA.norm(X0[i,:] - X1[j,:]) for i in range(n) for j in range(n)]
    )
    plt.hist(inter, density=True, alpha=0.75, label='between-cluster')

    plt.legend(loc='upper right')
    plt.title(f'dim={d}')

Next we plot the results for dimensions \(d=2, 100, 1000\). What do you observe?

Hide code cell source
highdim_2clusters(2, 100, 3)
../../_images/da0d13dd4b6a1baa247358bb5689f57851437df2842f3041da135ff4c6df5e2c.png
Hide code cell source
highdim_2clusters(100, 100, 3)
../../_images/c677d34c778163522fbc9b7f8bd7f95636a74a08d7271e947ff8f96c2fe09737.png
Hide code cell source
highdim_2clusters(1000, 100, 3)
../../_images/2e41d8f07b67a6d8049ae3100b41a9c4c80404b47b2b46d6fbdb4e75a010f615.png

As the dimension increases, the distributions of intra-cluster and inter-cluster distances overlap significantly and become more or less indistinguishable. That provides some insights into why clustering may fail here. Note that we used the same offset for all simulations. On the other hand, if the separation between the clusters is sufficiently large, one would expect clustering to work even in high dimension.

TRY IT! What precedes (and what follows in the next subsection) is not a formal proof that \(k\)-means clustering will be unsuccessful here. The behavior of the algorithm is quite complex and depends, in particular, on the initialization and the density of points. Here, increasing the number of data points eventually leads to a much better performance. Explore this behavior on your own by modifying the code. (For some theoretical justifications (beyond this course), see here and here.)

1.4.2. Surprising phenomena in high dimension#

In the previous section, we saw how the contribution from a large number of “noisy dimensions” can overwhelm the “signal” in the context of clustering. In this section we discuss further properties of high-dimensional space that are relevant to data science problems.

Applying Chebyshev’s inequality to sums of independent random variables has useful statistical implications: it shows that, with a large enough number of samples \(n\), the sample mean is close to the population mean. Hence it allows us to infer properties of a population from samples. Interestingly, one can apply a similar argument to a different asymptotic regime: the limit of large dimension \(d\). But as we will see in this section, the statistical implications are quite different.

High-dimensional cube To start explaining the quote above, we consider a simple experiment. Let \(\mathcal{C} = [-1/2,1/2]^d\) be the \(d\)-cube with side lengths \(1\) centered at the origin and let \(\mathcal{B} = \{\mathbf{x} \in \mathbb{R}^d : \|\mathbf{x}\|\leq 1/2\}\) be the inscribed \(d\)-ball.

Now pick a point \(\mathbf{X}\) uniformly at random in \(\mathcal{C}\). What is the probability that it falls in \(\mathcal{B}\)?

To generate \(\mathbf{X}\), we pick \(d\) independent random variables \(X_1, \ldots, X_d \sim \mathrm{U}[-1/2, 1/2]\), and form the vector \(\mathbf{X} = (X_1, \ldots, X_d)\). Indeed, the PDF of \(\mathbf{X}\) is then \(f_{\mathbf{X}}(\mathbf{x})= 1^d = 1\) if \(\mathbf{x} \in \mathcal{C}\) and \(0\) otherwise.

The event we are interested in is \(A = \left\{\|\mathbf{X}\| \leq 1/2\right\}\). The uniform distribution over the set \(\mathcal{C}\) has the property that \(\mathbb{P}[A]\) is the volume of \(\mathcal{B}\) divided by the volume of \(\mathcal{C}\). In this case, the volume of \(\mathcal{C}\) is \(1^d = 1\) and the volume of \(\mathcal{B}\) has an explicit formula.

This leads to the following surprising fact:

THEOREM (High-dimensional Cube) Let \(\mathcal{B} = \{\mathbf{x} \in \mathbb{R}^d \,:\, \|\mathbf{x}\|\leq 1/2\}\) and \(\mathcal{C} = [-1/2,1/2]^d\). Pick \(\mathbf{X} \sim \mathrm{U}[\mathcal{C}]\). Then, as \(d \to +\infty\),

\[ \mathbb{P}[\mathbf{X} \in \mathcal{B}] \to 0. \]

\(\sharp\)

In words, in high dimension if one picks a point at random from the cube, it is unlikely to be close to the origin. Instead it is likely to be in the corners. A geometric interpretation is that a high-dimensional cube is a bit like a “spiky ball.”

Figure: Visualization of a high-dimensional cube as a spiky ball (Credit: Made with Midjourney)

Visualization of a high-dimensional cube

\(\bowtie\)

We give a proof based on Chebyshev’s inequality. It has the advantage of providing some insight into this counter-intuitive phenomenon by linking it to the concentration of sums of independent random variables, in this case the squared norm of \(\mathbf{X}\).

Proof idea: We think of \(\|\mathbf{X}\|^2\) as a sum of independent random variables and apply Chebyshev’s inequality. It implies that the norm of \(\mathbf{X}\) is concentrated around its mean, which grows like \(\sqrt{d}\). The latter is larger than \(1/2\) for \(d\) large.

Proof: To see the relevance of Chebyshev’s inequality, we compute the mean and standard deviation of the norm of \(\mathbf{X}\). In fact, because of the square root in \(\|\mathbf{X}\|\), computing its expectation is difficult. Instead we work with the squared norm

\[ \|\mathbf{X}\|^2 = X_1^2 + X_2^2 + \cdots + X_d^2, \]

which has the advantage of being a sum of independent random variables - for which the expectation is much easier to compute. Observe further that the probability of the event of interest \(\{\|\mathbf{X}\| \leq 1/2\}\) can be re-written in terms of \(\|\mathbf{X}\|^2\) as follows

\[\begin{align*} \mathbb{P} \left[ \|\mathbf{X}\| \leq 1/2 \right] &= \mathbb{P} \left[ \|\mathbf{X}\|^2 \leq 1/4 \right]. \end{align*}\]

To simplify the notation, we use \(\tilde\mu = \mathbb{E}[X_1^2]\) and \(\tilde\sigma = \sqrt{\mathrm{Var}[X_1^2]}\) for the mean and standard deviation of \(X_1^2\) respectively. Using linearity of expectation and the fact that the \(X_i\)’s are independent, we get

\[ \mu_{\|\mathbf{X}\|^2} = \mathbb{E}\left[ \|\mathbf{X}\|^2 \right] = \sum_{i=1}^d \mathbb{E}[X_i^2] = d \,\mathbb{E}[X_1^2] = \tilde\mu \, d, \]

and

\[ \mathrm{Var}\left[ \|\mathbf{X}\|^2 \right] = \sum_{i=1}^d \mathrm{Var}[X_i^2] = d \,\mathrm{Var}[X_1^2]. \]

Taking a square root, we get an expression for the standard deviation of our quantity of interest \(\|\mathbf{X}\|^2\) in terms of the standard deviation of \(X_1^2\)

\[ \sigma_{\|\mathbf{X}\|^2} = \tilde\sigma \, \sqrt{d}. \]

(Note that we could compute \(\tilde\mu\) and \(\tilde\sigma\) explicitly, but it will not be necessary here.)

We use Chebyshev’s inequality to show that \(\|\mathbf{X}\|^2\) is highly likely to be close to its mean \(\tilde\mu \, d\), which is much larger than \(1/4\) when \(d\) is large. And that therefore \(\|\mathbf{X}\|^2\) is highly unlikely to be smaller than \(1/4\). We give the details next.

By the one-sided version of Chebyshev’s inequality in terms of the standard deviation, we have

\[ \mathbb{P}\left[ \|\mathbf{X}\|^2 - \mu_{\|\mathbf{X}\|^2} \leq - \alpha \right] \leq \left(\frac{\sigma_{\|\mathbf{X}\|^2}}{\alpha}\right)^2. \]

That is, using the formulas above and rearranging slightly,

\[ \mathbb{P}\left[ \|\mathbf{X}\|^2 \leq \tilde\mu \, d - \alpha \right] \leq \left(\frac{\tilde\sigma \, \sqrt{d}}{\alpha}\right)^2. \]

How do we relate this to the probability of interest \(\mathbb{P}\left[\|\mathbf{X}\|^2 \leq 1/4\right]\)? Recall that we are free to choose \(\alpha\) in this inequality. So simply take \(\alpha\) such that

\[ \tilde\mu \,d - \alpha = \frac{1}{4}, \]

that is, \(\alpha = \tilde\mu \,d - 1/4\). Observe that, once \(d\) is large enough, it holds that \(\alpha > 0\).

Finally, replacing this choice of \(\alpha\) in the inequality above gives

\[\begin{align*} \mathbb{P} \left[ \|\mathbf{X}\| \leq 1/2 \right] &= \mathbb{P}\left[\|\mathbf{X}\|^2 \leq 1/4\right]\\ &= \mathbb{P}\left[ \|\mathbf{X}\|^2 \leq \tilde\mu \, d - \alpha \right]\\ &\leq \left(\frac{\tilde\sigma \, \sqrt{d}}{\alpha}\right)^2\\ &\leq \left(\frac{\tilde\sigma \, \sqrt{d}}{\tilde\mu \,d - 1/4}\right)^2. \end{align*}\]

Critically, \(\tilde\mu\) and \(\tilde\sigma\) do not depend on \(d\). So the right-hand side goes to \(0\) as \(d \to +\infty\). Indeed, \(d\) is much larger than \(\sqrt{d}\) when \(d\) is large. That proves the claim.\(\square\)

We will see later in the course that this high-dimensional phenomenon has implications for data science problems. It is behind what is referred to as the Curse of Dimensionality.

While Chebyshev’s inequality correctly implies that \(\mathbb{P}[\mathbf{X} \in \mathcal{B}]\) goes to \(0\), it does not give the correct rate of convergence. In reality, that probability goes to \(0\) at a much faster rate than \(1/d\). Specifically, it can be shown that \(\mathbb{P}[\mathbf{X} \in \mathcal{B}]\) goes to \(0\) roughly as \(d^{-d/2}\). We will not need or derive this fact here.

NUMERICAL CORNER: We can check the theorem in a simulation. Here we pick \(n\) points uniformly at random in the \(d\)-cube \(\mathcal{C}\), for a range of dimensions up to dmax. We then plot the frequency of landing in the inscribed \(d\)-ball \(\mathcal{B}\) and see that it rapidly converges to \(0\). Alternatively, we could just plot the formula for the volume of \(\mathcal{B}\). But knowing how to do simulations is useful in situations where explicit formulas are unavailable or intractable.

def highdim_cube(dmax, n):
    
    in_ball = np.zeros(dmax)
    for d in range(dmax):
        # recall that d starts at 0 so we add 1 below
        in_ball[d] = np.mean(
            [(LA.norm(rng.random(d+1) - 1/2) < 1/2) for _ in range(n)]
        )
    
    plt.plot(np.arange(1,dmax+1), in_ball) 
    plt.xlabel('dim')
    plt.ylabel('in-ball freq')
    plt.show()

We plot the result up to dimension \(10\).

Hide code cell source
highdim_cube(10, 1000)
../../_images/e2aed1f5ce645283365eb779f9d59048619ec95c3af992657cdfb64074f348c0.png

\(\unlhd\)

Gaussians in high dimension In this section, we turn our attention to the Gaussian (or Normal) distribution and its behavior in high dimension. Using Chebyshev’s inequality, we show that a standard Normal vector has the following counter-intuititve property in high dimension: a typical draw has \(2\)-norm that is highly likely to be around \(\sqrt{d}\). Visually, when \(d\) is large, the joint PDF looks something like this:

Figure: A spherical shell (Source)

A spherical shell

\(\bowtie\)

This is unexpected because the joint PDF is maximized at \(\mathbf{x} = \mathbf{0}\) for all \(d\) (including \(d=1\) as can be seen in the figure above). But the rough intuition is the following: (1) there is only “one way” to obtain \(\|\mathbf{X}\|^2 = 0\) – every coordinate must be \(0\) by the point-separating property of the \(2\)-norm; (2) on the other hand, there are “many ways” to obtain \(\|\mathbf{X}\|^2 = \sqrt{d}\) – and that compensates for the lower density.

THEOREM (High-dimensional Gaussians) Let \(\mathbf{X}\) be a standard Normal \(d\)-vector. Then, for any \(\varepsilon > 0\),

\[ \mathbb{P} \left[\, \|\mathbf{X}\| \notin (\sqrt{d(1-\varepsilon)}, \sqrt{d(1+\varepsilon)}) \,\right] \to 0 \]

as \(d \to +\infty\). \(\sharp\)

Proof idea: We apply Chebyshev’s inequality to the squared norm, which is a sum of independent random variables.

Proof: Let \(Z = \|\mathbf{X}\|^2 = \sum_{i=1}^d X_i^2\) and notice that, by definition, it is a sum of independent random variables. Appealing to the expectation and variance formulas from the previous sections:

\[ \mathbb{E}[\|\mathbf{X}\|^2] = d \,\mathbb{E}[X_1^2] = d \,\mathrm{Var}[X_1] = d \]

and

\[ \mathrm{Var}[\|\mathbf{X}\|^2] = d \,\mathrm{Var}[X_1^2] \]

where \(\mathrm{Var}[X_1^2]\) does not depend on \(d\). By Chebyshev’s inequality

\[ \mathbb{P}\left[ \|\mathbf{X}\|^2 \notin \left( d(1-\varepsilon),d(1+\varepsilon) \right) \right] = \mathbb{P}[|\|\mathbf{X}\|^2 - d| \geq \varepsilon d] \leq \frac{d \,\mathrm{Var}[X_1^2]}{\varepsilon^2 d^2} = \frac{\mathrm{Var}[X_1^2]}{d \varepsilon^2}. \]

Taking a square root inside the probability on the leftmost side and taking a limit as \(d \to +\infty\) on the rightmost side gives the claim.\(\square\)

NUMERICAL CORNER: We check our claim in a simulation. We generate standard Normal \(d\)-vectors using the rng.normal(0,1,d) function and plot the histogram of their \(2\)-norm.

def normal_shell(d, n):
    one_sample_norm = [LA.norm(rng.normal(0,1,d)) for _ in range(n)]
    plt.hist(one_sample_norm, bins=20)
    plt.xlim=(0,np.stack(one_sample_norm).max())
    plt.show()

We first plot it in one dimensions.

Hide code cell source
normal_shell(1, 10000)
../../_images/9bee9979e478a7b937c3b0cdc00cc48bfd1c8334c63e4de68e8d0e59665ac9c8.png

In higher dimension:

Hide code cell source
normal_shell(100, 10000)
../../_images/41b7b30823377dd7222dfea5501fb0f252cb5709b128d630fe6ab41dd7edf5a3.png

\(\unlhd\)