More advanced material: clustering in high dimension

1.5. More advanced material: clustering in high dimension#

In this section, we give a more rigorous analysis of the high-dimensional phenomena discussed in the previous section.

1.5.1. A more rigorous analysis of clustering in high dimension#

In this optional section, we give one formal statement of the phenomenon described in the previous subsection.

THEOREM Let $\mathbf{X}_1, \mathbf{X}_2, \mathbf{Y}_1$ be independent spherical $d$-dimensional Gaussians with mean $-w_d \mathbf{e}_1$ and variance $1$, where $\{w_d\}$ is a monotone sequence in $d$. Let $\mathbf{Y}_2$ be an indepedent spherical $d$-dimensional Gaussian with mean $w_d \mathbf{e}_1$ and variance $1$. Then, letting $\Delta_d = \|\mathbf{Y}_1 - \mathbf{Y}_2\|^2 - \|\mathbf{X}_1 - \mathbf{X}_2\|^2$, as $d \to +\infty$

\[\begin{split} \frac{\mathbb{E}[\Delta_d]}{\sqrt{\mathrm{Var}[\Delta_d]}} \to \begin{cases} 0, & \text{if $w_d \ll d^{1/4}$}\\ +\infty, & \text{if $w_d \gg d^{1/4}$} \end{cases} \end{split}\]

where $w_d \ll d^{1/4}$ means $w_d/d^{1/4} \to 0$. $\sharp$

The ratio is the statement is referred to as the signal-to-noise ratio.

To prove the claim, we will need the following property.

LEMMA Let $W$ be a real-valued random variable symmetric about zero, that is, such that $W$ and $-W$ are identically distributed. Then for all odd $k$, $\mathbb{E}[W^k] = 0$. $\flat$

Proof: By the symmetry,

\[ \mathbb{E}[W^k] = \mathbb{E}[(-W)^k] = \mathbb{E}[(-1)^k W^k] = - \mathbb{E}[W^k]. \]

The only way to satisfy this equation is to have $\mathbb{E}[W^k] = 0$. $\square$

Returning to the proof of the claim:

Proof idea: (Theorem) The only coordinate contributing to $\mathbb{E}[\Delta_d]$ is the first one by linearity of expectation, while all coordinates contribute to $\mathrm{Var}[\Delta_d]$. More specifically, a calculation shows that the former is $c_0 w^2$ while the latter is $c_1 w^2 + c_2 d$, where $c_0, c_1, c_2$ are constants.

Proof: (Claim) Write $w := w_d$ and $\Delta := \Delta_d$ to simplify the notation. There are two steps:

(1) Expectation of $\Delta$: By defintion, the random variables $X_{1,i} - X_{2,i}$, $i = 1,\ldots, d$, and $Y_{1,i} - Y_{2,i}$, $i = 2,\ldots, d$, are identically distributed. So, by linearity of expectation,

\[\begin{align*} \mathbb{E}[\Delta] &= \sum_{i=1}^d \mathbb{E}[(Y_{1,i} - Y_{2,i})^2] - \sum_{i=1}^d \mathbb{E}[(X_{1,i} - X_{2,i})^2]\\ &= \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] - \mathbb{E}[(X_{1,1} - X_{2,1})^2]. \end{align*}\]

Further, we can write $Y_{1,1} - Y_{1,2} \sim (Z_1 -w) - (Z_2+w)$ where $Z_1, Z_2 \sim N(0,1)$ are independent, where here $\sim$ indicates equality in distribution. Hence, we have

\[\begin{align*} \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] &= \mathbb{E}[(Z_1 - Z_2 - 2w)^2]\\ &= \mathbb{E}[(Z_1 - Z_2)^2] - 4w \,\mathbb{E}[Z_1 - Z_2] + 4 w^2. \end{align*}\]

Similarly, $X_{1,1} - X_{1,2} \sim Z_1 - Z_2$ so $\mathbb{E}[(X_{1,1} - X_{2,1})^2] = \mathbb{E}[(Z_1 - Z_2)^2]$. Since $\mathbb{E}[Z_1 - Z_2] = 0$, we finally get $\mathbb{E}[\Delta] = 4 w^2$.

(2) Variance of $\Delta$: Using the observations from (1) and the independence of the coordinates we get

\[\begin{align*} \mathrm{Var}[\Delta] &= \sum_{i=1}^d \mathrm{Var}[(Y_{1,i} - Y_{2,i})^2] + \sum_{i=1}^d \mathrm{Var}[(X_{1,i} - X_{2,i})^2]\\ &= \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] + (2d-1) \,\mathrm{Var}[(Z_1 - Z_2)^2]. \end{align*}\]

By the Variance of a Sum,

\[\begin{align*} \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2) + 4w^2]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2)]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2] + 16 w^2 \mathrm{Var}[Z_1 - Z_2]\\ &\quad - 8w \,\mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2]. \end{align*}\]

Because $Z_1$ and $Z_2$ are independent, $\mathrm{Var}[Z_1 - Z_2] = \mathrm{Var}[Z_1] + \mathrm{Var}[Z_2] = 2$. Moreover, the random variable $(Z_1 - Z_2)$ is symmetric, so

\[\begin{align*} \mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2] &= \mathbb{E}[(Z_1 - Z_2)^3]\\ & \quad - \mathbb{E}[(Z_1 - Z_2)^2] \,\mathbb{E}[Z_1 - Z_2]\\ &= 0. \end{align*}\]

Finally,

\[ \mathrm{Var}[\Delta] = 32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2] \]

Putting everything together:

\[ \frac{\mathbb{E}[\Delta]}{\sqrt{\mathrm{Var}[\Delta]}} = \frac{4 w^2}{\sqrt{32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2]}}. \]

Taking $d \to +\infty$ gives the claim. $\square$

More advanced material: clustering in high dimension

Contents

1.5. More advanced material: clustering in high dimension#

1.5.1. A more rigorous analysis of clustering in high dimension#