More advanced material: orthogonality in high dimension; linear independence lemma

\(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\)

2.6. More advanced material: orthogonality in high dimension; linear independence lemma#

In this section, we collect a few additional proofs that were skipped over in the chapter. We also discuss orthogonality in high dimension and the Cholesky decomposition.

2.6.1. Orthogonality in high dimension#

In high dimension, orthogonality – or more accurately near-orthogonality – is more common than one might expect. We illustrate this phenomenon here.

Let \(\mathbf{X}\) be a standard Normal \(d\)-vector. Its joint PDF depends only on the its norm \(\|\mathbf{X}\|\). So \(\mathbf{Y} = \frac{\mathbf{X}}{\|\mathbf{X}\|}\) is uniformly distributed over the \((d-1)\)-sphere \(\mathcal{S} = \{\mathbf{x}\in \mathbb{R}^d:\|\mathbf{x}\|=1\}\), that is, the surface of the unit \(d\)-ball centered aroungd the origin. We write \(\mathbf{Y} \sim \mathrm{U}[\mathcal{S}]\). The following theorem shows that if we take two independent samples \(\mathbf{Y}_1, \mathbf{Y}_2 \sim \mathrm{U}[\mathcal{S}]\) they are likely to be nearly orthogonal when \(d\) is large, that is, \(|\langle\mathbf{Y}_1, \mathbf{Y}_2\rangle|\) is likely to be small. By symmetry, there is no loss of generality in taking one of the two vectors to be the north pole \(\mathbf{e}_d = (0,\ldots,0,1)\). A different way to state the theorem is that most of the mass of the \((d-1)\)-sphere is in a small band around the equator.

Figure: Band around the equator (Source)

Band around the equator

\(\bowtie\)

THEOREM (Orthogonality in High Dimension) Let \(\mathcal{S} = \{\mathbf{x}\in \mathbb{R}^d:\|\mathbf{x}\|=1\}\) and \(\mathbf{Y} \sim \mathrm{U}[\mathcal{S}]\). Then for any \(\varepsilon > 0\), as \(d \to +\infty\),

\[ \mathbb{P}[|\langle\mathbf{Y}, \mathbf{e}_d\rangle| \geq \varepsilon] \to 0. \]

\(\sharp\)

Proof idea: We write \(\mathbf{Y}\) in terms of a standard Normal. Its squared norm is a sum of independent random variables. After bringing it to the numerator, we can apply Chebyshev.

Proof: Recall that \(\mathbf{Y}\) is \(\frac{\mathbf{X}}{\|\mathbf{X}\|}\) where \(\mathbf{X}\) is a standard Normal \(d\)-vector. The probability we want to bound can be re-written as

\[\begin{align*} \mathbb{P}[|\langle\mathbf{Y}, \mathbf{e}_d\rangle| \geq \varepsilon] &= \mathbb{P}\left[\left|\left\langle\frac{\mathbf{X}}{\|\mathbf{X}\|}, \mathbf{e}_d\right\rangle\right|^2 \geq \varepsilon^2\right]\\ &= \mathbb{P}\left[\left|\frac{\langle\mathbf{X},\mathbf{e}_d\rangle}{\|\mathbf{X}\|}\right|^2 \geq \varepsilon^2\right]\\ &= \mathbb{P}\left[\frac{X_d^2}{\sum_{j=1}^d X_j^2} \geq \varepsilon^2\right]\\ &= \mathbb{P}\left[X_d^2 \geq \varepsilon^2 \sum_{j=1}^d X_j^2\right]\\ &= \mathbb{P}\left[\sum_{j=1}^{d-1} (-\varepsilon^2 X_j^2) + (1-\varepsilon^2) X_d^2 \geq 0\right]. \end{align*}\]

We are now looking at a sum of independent (but not identically distributed) random variables

\[ Z = \sum_{j=1}^{d-1} (-\varepsilon^2 X_j^2) + (1-\varepsilon^2) X_d^2 \]

and we can appeal to our usual Chebyshev machinery. The expectation is, by linearity,

\[ \mathbb{E}[Z] = - \sum_{j=1}^{d-1} \varepsilon^2 \mathbb{E}[X_j^2] + (1-\varepsilon^2) \mathbb{E}[X_d^2] = \{- (d-1) \,\varepsilon^2 + (1-\varepsilon^2)\} \]

where we used that \(X_1,\ldots,X_d\) are standard Normal variables and that, in particular, their mean is \(0\) and their variance is \(1\) so that \(\mathbb{E}[X_1^2] = 1\).

The variance is, by independence of the \(X_j\)’s,

\[\begin{align*} \mathrm{Var}[Z] &= \sum_{j=1}^{d-1} \varepsilon^4 \mathrm{Var}[X_j^2] + (1-\varepsilon^2)^2 \mathrm{Var}[X_d^2]\\ &= \{(d-1) \,\varepsilon^4 + (1-\varepsilon^2)^2\}\mathrm{Var}[X_1^2]. \end{align*}\]

So by Chebyshev

\[\begin{align*} \mathbb{P}\left[Z \geq 0\right] &\leq \mathbb{P}\left[\left|Z - \mathbb{E}[Z]\right|\geq |\mathbb{E}[Z]|\right]\\ &\leq \frac{\mathrm{Var}[Z]}{\mathbb{E}[Z]^2}\\ &= \frac{\{(d-1) \,\varepsilon^4 + (1-\varepsilon^2)^2\} \mathrm{Var}[X_1^2]}{\{- (d-1) \,\varepsilon^2 + (1-\varepsilon^2)\}^2}\\ &\to 0 \end{align*}\]

as \(d \to +\infty\). To get the limit we observed that, for large \(d\), the deminator scales like \(d^2\) while the numerator scales only like \(d\). \(\square\)

2.6.2. Additional proofs#

In this subsection, we cover a few more advanced details. In particular, we prove the Linear Dependence Lemma and Dimension Theorem.

Linear Dependence Lemma and its implications In this section, we give a proof of the Dimension Theorem.

The proof relies on a fundamental lemma. It states that we can always remove a vector from a list of linearly dependent ones without changing its span.

LEMMA (Linear Dependence) Let \(\mathbf{u}_1,\ldots,\mathbf{u}_m\) be a list of linearly dependent vectors with \(\mathbf{u}_1 \neq 0\). Then there is an \(i\) such that:

  1. \(\mathbf{u}_i \in \mathrm{span}(\mathbf{u}_1,\ldots,\mathbf{u}_{i-1})\)

  2. \(\mathrm{span}(\{\mathbf{u}_j:j \in [m]\}) = \mathrm{span}(\{\mathbf{u}_j:j \in [m],\ j\neq i\})\)

\(\flat\)

Proof idea (Linear Dependence): By linear dependence, \(\mathbf{0}\) can be written as a non-trivial linear combination of the \(\mathbf{u}_j\)’s. Then the index \(i\) in 1. is the largest index with non-zero coefficient.

Proof: (Linear Dependence) For 1., by linear dependence,

\[ \sum_{j=1}^m \alpha_j \mathbf{u}_j = \mathbf{0}, \]

with not all \(\alpha_j\)’s zero. Further, because \(\mathbf{u}_1 \neq \mathbf{0}\), not all \(\alpha_2, \ldots, \alpha_m\) are zero. (Why? [Hint: There are two cases to consider: \(\alpha_1 = 0\) or \(\alpha_1 \neq 0\).]) Take the largest index among the \(\alpha_j\)’s that are non-zero, say \(i\). Then rearranging the previous display and using the fact that \(\alpha_j=0\) for \(j > i\) gives

\[ \mathbf{u}_i = - \sum_{j=1}^{i-1} \frac{\alpha_j}{\alpha_i} \mathbf{u}_j \in \mathrm{span}(\mathbf{u}_1,\ldots,\mathbf{u}_{i-1}). \]

For 2., we note that for any \(\mathbf{w} \in \mathrm{span}(\{\mathbf{u}_j:j \in [m]\})\) we can write it as \(\mathbf{w} = \sum_{j=1}^m \beta_j \mathbf{u}_j\) and we can replace \(\mathbf{u}_i\) by the equation above, producing a representation of \(\mathbf{w}\) in terms of \(\{\mathbf{u}_j:j \in [m], j\neq i\}\).\(\square\)

Before proving the theorem, we use the Linear Dependence Lemma to prove a key claim.

LEMMA (Independent is Shorter than Spanning) The length of any independent list in \(U\) is less or equal than the length of any spanning list of \(U\).\(\flat\) Proof (Independent is Shorter than Spanning): Let \(\{\mathbf{u}_j:j\in[m]\}\) be an independent list in \(U\) and let \(\{\mathbf{w}_i:i\in [m']\}\) be a spanning list of \(U\). First consider the new list \(\{\mathbf{u}_1,\mathbf{w}_1,\ldots,\mathbf{w}_{m'}\}\). Because the \(\mathbf{w}_i\)’s are spanning, adding \(\mathbf{u}_1 \neq \mathbf{0}\) to them necessarily produces a linearly dependent list. By the Linear Dependence Lemma, we can remove one of the \(\mathbf{w}_i\)’s without changing the span. The new list \(B\) has length \(m'\) again. Then we add \(\mathbf{u}_2\) to \(B\) immediately after \(\mathbf{u}_1\). By the Linear Dependence Lemma, one of the vectors in this list is in the span of the previous ones. It cannot be \(\mathbf{u}_2\) as \(\{\mathbf{u}_1, \mathbf{u}_2\}\) are linearly independent by assumption. So it must be one of the remaining \(\mathbf{w}_i\)’s. We remove that one, without changing the span by the Linear Dependence Lemma again. This process can be continued until we have added all the \(\mathbf{u}_j\)’s, as otherwise we would have a contradiction in the argument above. Hence, there must be at least as many \(\mathbf{w}_i\)’s as there are \(\mathbf{u}_j\)’s.\(\square\)

We finaly return to the proof of the theorem.

Proof (Dimension): Let \(\{\mathbf{b}_j:j\in[m]\}\) and \(\{\mathbf{b}'_j:j\in[m']\}\) be two bases of \(U\). Because they both form independent and spanning lists, the Independent is Shorter than Spanning Lemma implies that each has a length smaller or equal than the other. So their lenghts must be equal. This proves the claim. \(\square\)

We prove a few other useful implications of the Linear Dependence Lemma.

LEMMA (Existence of a Spanning List) There exists a spanning list of \(U\) with a finite number of elements. \(\flat\)

Proof (Existence of a Spanning List): Start with an empty list. While possible, add nonzero vectors in \(U\) not in the span of the previously added vectors. By the Linear Dependence Lemma, at every step we have constructed an independent list. By the Independent is Shorter than Spanning Lemma, this list cannot be longer than any spanning list of \(V\). Because \(V\) is spanned by its standard basis, the process must stop after a finite number of steps. \(\square\)

LEMMA (Completing an Independent List) Any independent list in \(U\) can be completed into a basis of \(U\). \(\flat\)

Proof (Completing an Independent List): Let \(\{\mathbf{u}_j:j\in[\ell]\}\) be an independent list in \(U\). Let \(\{\mathbf{w}_i : i \in [m]\}\) be a spanning list of \(U\), which exists by the Existence of a Spanning List Lemma. Add the vectors from the spanning list one by one to the independent list if they are not in the span of the previously constructed list (or discard them otherwise). By the Linear Dependence Lemma, the list remains independent at each step. After \(m\) steps, the resulting list spans all of the \(\mathbf{w}_i\)’s. Hence it spans \(U\) and is linearly independent - that is, it is a basis of \(U\). \(\square\)