More advanced material: clustering in high dimension

1.5. More advanced material: clustering in high dimension#

In this section, we give a more rigorous analysis of the high-dimensional phenomena discussed in the previous section.

1.5.1. A more rigorous analysis of clustering in high dimension#

In this optional section, we give one formal statement of the phenomenon described in the previous subsection.

THEOREM Let \(\mathbf{X}_1, \mathbf{X}_2, \mathbf{Y}_1\) be independent spherical \(d\)-dimensional Gaussians with mean \(-w_d \mathbf{e}_1\) and variance \(1\), where \(\{w_d\}\) is a monotone sequence in \(d\). Let \(\mathbf{Y}_2\) be an indepedent spherical \(d\)-dimensional Gaussian with mean \(w_d \mathbf{e}_1\) and variance \(1\). Then, letting \(\Delta_d = \|\mathbf{Y}_1 - \mathbf{Y}_2\|^2 - \|\mathbf{X}_1 - \mathbf{X}_2\|^2\), as \(d \to +\infty\)

\[\begin{split} \frac{\mathbb{E}[\Delta_d]}{\sqrt{\mathrm{Var}[\Delta_d]}} \to \begin{cases} 0, & \text{if $w_d \ll d^{1/4}$}\\ +\infty, & \text{if $w_d \gg d^{1/4}$} \end{cases} \end{split}\]

where \(w_d \ll d^{1/4}\) means \(w_d/d^{1/4} \to 0\). \(\sharp\)

The ratio is the statement is referred to as the signal-to-noise ratio.

To prove the claim, we will need the following property.

LEMMA Let \(W\) be a real-valued random variable symmetric about zero, that is, such that \(W\) and \(-W\) are identically distributed. Then for all odd \(k\), \(\mathbb{E}[W^k] = 0\). \(\flat\)

Proof: By the symmetry,

\[ \mathbb{E}[W^k] = \mathbb{E}[(-W)^k] = \mathbb{E}[(-1)^k W^k] = - \mathbb{E}[W^k]. \]

The only way to satisfy this equation is to have \(\mathbb{E}[W^k] = 0\). \(\square\)

Returning to the proof of the claim:

Proof idea: (Theorem) The only coordinate contributing to \(\mathbb{E}[\Delta_d]\) is the first one by linearity of expectation, while all coordinates contribute to \(\mathrm{Var}[\Delta_d]\). More specifically, a calculation shows that the former is \(c_0 w^2\) while the latter is \(c_1 w^2 + c_2 d\), where \(c_0, c_1, c_2\) are constants.

Proof: (Claim) Write \(w := w_d\) and \(\Delta := \Delta_d\) to simplify the notation. There are two steps:

(1) Expectation of \(\Delta\): By defintion, the random variables \(X_{1,i} - X_{2,i}\), \(i = 1,\ldots, d\), and \(Y_{1,i} - Y_{2,i}\), \(i = 2,\ldots, d\), are identically distributed. So, by linearity of expectation,

\[\begin{align*} \mathbb{E}[\Delta] &= \sum_{i=1}^d \mathbb{E}[(Y_{1,i} - Y_{2,i})^2] - \sum_{i=1}^d \mathbb{E}[(X_{1,i} - X_{2,i})^2]\\ &= \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] - \mathbb{E}[(X_{1,1} - X_{2,1})^2]. \end{align*}\]

Further, we can write \(Y_{1,1} - Y_{1,2} \sim (Z_1 -w) - (Z_2+w)\) where \(Z_1, Z_2 \sim N(0,1)\) are independent, where here \(\sim\) indicates equality in distribution. Hence, we have

\[\begin{align*} \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] &= \mathbb{E}[(Z_1 - Z_2 - 2w)^2]\\ &= \mathbb{E}[(Z_1 - Z_2)^2] - 4w \,\mathbb{E}[Z_1 - Z_2] + 4 w^2. \end{align*}\]

Similarly, \(X_{1,1} - X_{1,2} \sim Z_1 - Z_2\) so \(\mathbb{E}[(X_{1,1} - X_{2,1})^2] = \mathbb{E}[(Z_1 - Z_2)^2]\). Since \(\mathbb{E}[Z_1 - Z_2] = 0\), we finally get \(\mathbb{E}[\Delta] = 4 w^2\).

(2) Variance of \(\Delta\): Using the observations from (1) and the independence of the coordinates we get

\[\begin{align*} \mathrm{Var}[\Delta] &= \sum_{i=1}^d \mathrm{Var}[(Y_{1,i} - Y_{2,i})^2] + \sum_{i=1}^d \mathrm{Var}[(X_{1,i} - X_{2,i})^2]\\ &= \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] + (2d-1) \,\mathrm{Var}[(Z_1 - Z_2)^2]. \end{align*}\]

By the Variance of a Sum,

\[\begin{align*} \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2) + 4w^2]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2)]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2] + 16 w^2 \mathrm{Var}[Z_1 - Z_2]\\ &\quad - 8w \,\mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2]. \end{align*}\]

Because \(Z_1\) and \(Z_2\) are independent, \(\mathrm{Var}[Z_1 - Z_2] = \mathrm{Var}[Z_1] + \mathrm{Var}[Z_2] = 2\). Moreover, the random variable \((Z_1 - Z_2)\) is symmetric, so

\[\begin{align*} \mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2] &= \mathbb{E}[(Z_1 - Z_2)^3]\\ & \quad - \mathbb{E}[(Z_1 - Z_2)^2] \,\mathbb{E}[Z_1 - Z_2]\\ &= 0. \end{align*}\]

Finally,

\[ \mathrm{Var}[\Delta] = 32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2] \]

Putting everything together:

\[ \frac{\mathbb{E}[\Delta]}{\sqrt{\mathrm{Var}[\Delta]}} = \frac{4 w^2}{\sqrt{32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2]}}. \]

Taking \(d \to +\infty\) gives the claim. \(\square\)