\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

1.6. Online suppplementary material#

1.6.1. Quizzes, solutions, code, etc.#

1.6.1.1. Just the code#

An interactive Jupyter notebook featuring the code in this chapter can be accessed below (Google Colab recommended). You are encouraged to tinker with it. Some suggested computational exercises are scattered throughout. The notebook is also available as a slideshow.

1.6.1.2. Self-assessment quizzes#

A more extensive web version of the self-assessment quizzes is available by following the links below.

1.6.1.3. Auto-quizzes#

Automatically generated quizzes for this chapter can be accessed here (Google Colab recommended).

1.6.1.4. Solutions to odd-numbered warm-up exercises#

(with help from Claude, Gemini, and ChatGPT)

Answer and justification for E1.2.1: The Euclidean norm \(\|\mathbf{x}\|_2\) is given by

\[ \|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2} = \sqrt{6^2 + 8^2} = \sqrt{36 + 64} = \sqrt{100} = 10. \]

Answer and justification for E1.2.3: The transpose \(A^T\) is obtained by switching rows and columns

\[\begin{split} A^T = \begin{pmatrix}1 & 4 \\ 2 & 5 \\ 3 & 6\end{pmatrix} \end{split}\]

Answer and justification for E1.2.5: A matrix \(A\) is symmetric if \(A = A^T\)

\[\begin{split} A^T = \begin{pmatrix}2 & 0 \\ 0 & 3\end{pmatrix} = A. \end{split}\]

Thus, \(A\) is symmetric.

Answer and justification for E1.2.7: \(\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h, y) - f(x, y)}{h} = \lim_{h \to 0} \frac{(x+h)^2 + (x+h)y + y^2 - (x^2 + xy + y^2)}{h} = \lim_{h \to 0} \frac{2xh + h^2 + hy}{h} = 2x + y\).

\(\frac{\partial f}{\partial y} = \lim_{h \to 0} \frac{f(x, y+h) - f(x, y)}{h} = \lim_{h \to 0} \frac{x^2 + x(y+h) + (y+h)^2 - (x^2 + xy + y^2)}{h} = \lim_{h \to 0} \frac{xh + 2yh + h^2}{h} = x + 2y\).

Answer and justification E1.2.9: By Taylor’s Theorem, \(f(x) = f(a) + (x - a)f'(a) + \frac{1}{2}(x - a)^2f''(\xi)\) for some \(\xi\) between \(a\) and \(x\). Here, \(f(1) = 1^3 - 3 \cdot 1^2 + 2 \cdot 1 = 0\), \(f'(x) = 3x^2 - 6x + 2\), so \(f'(1) = 3 - 6 + 2 = -1\), and \(f''(x) = 6x - 6\). Therefore, \(f(x) = 0 + (x - 1)(-1) + \frac{1}{2}(x - 1)^2(6\xi - 6)\) for some \(\xi\) between \(1\) and \(x\).

Answer and justification for E1.2.11: First, compute \(\E[X^2]\)

\[ \E[X^2] = \sum_{x} x^2 \cdot P(X = x) = 1^2 \cdot 0.4 + 2^2 \cdot 0.6 = 0.4 + 4 \cdot 0.6 = 0.4 + 2.4 = 2.8. \]

Then, use the formula \(\mathrm{Var}[X] = \E[X^2] - (\E[X])^2\)

\[ \mathrm{Var}[X] = 2.8 - (1.6)^2 = 2.8 - 2.56 = 0.24. \]

Answer and justification for E1.2.13: By Chebyshev’s inequality, \(\P[|X - \mathbb{E}[X]| \geq \alpha] \leq \frac{\mathrm{Var}[X]}{\alpha^2}\) for any \(\alpha > 0\). Here, \(\alpha = 4\), so \(\P[|X - 3| \geq 4] \leq \frac{4}{4^2} = \frac{1}{4}\).

Answer and justification for E1.2.15: The covariance matrix of \(X\) is \(\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\). Since \(X_1\) and \(X_2\) are independent, their covariance is zero. The variance of each is 1 since they are standard normal.

Answer and justification for E1.2.17: \(\mathbb{E}[AX] = A\mathbb{E}[X] = A\mu_X = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \end{pmatrix} = \begin{pmatrix} 5 \\ 11 \end{pmatrix}\), \(\text{Cov}[AX] = A\text{Cov}[X]A^T = A\Sigma_XA^T = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 3 & 1 \\ 1 & 4 \end{pmatrix} \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix} = \begin{pmatrix} 11 & 19 \\ 19 & 35 \end{pmatrix}\)

Answer and justification for E1.2.19: For any non-zero vector \(z = \begin{pmatrix} z_1 \\ z_2 \end{pmatrix}\), we have

\[\begin{align*} z^TAz &= \begin{pmatrix} z_1 & z_2 \end{pmatrix} \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix} \begin{pmatrix} z_1 \\ z_2 \end{pmatrix} \\ &= \begin{pmatrix} z_1 & z_2 \end{pmatrix} \begin{pmatrix} 2z_1 - z_2 \\ -z_1 + 2z_2 \end{pmatrix} \\ &= 2z_1^2 - 2z_1z_2 + 2z_2^2 \\ &= z_1^2 + (z_1 - z_2)^2 + z_2^2 > 0 \end{align*}\]

since \((z_1 - z_2)^2 \geq 0\), and either \(z_1^2 > 0\) or \(z_2^2 > 0\) (since \(z \neq 0\)). Therefore, \(A\) is positive definite.

Answer and justification for E1.3.1: \(\boldsymbol{\mu}_1^* = \frac{1}{2}(\mathbf{x}_1 + \mathbf{x}_4) = (\frac{3}{2}, 1)\), \(\boldsymbol{\mu}_2^* = \frac{1}{2}(\mathbf{x}_2 + \mathbf{x}_3) = (-\frac{1}{2}, 0)\).

Answer and justification for E1.3.3: \(C_1 = \{1, 5\}\), \(C_2 = \{3\}\), \(C_3 = \{2, 4\}\).

Answer and justification for E1.3.5: Expanding the squared norms and cancelling terms yields the equivalent inequality.

Answer and justification for E1.3.7: Expand \(\|A + B\|_F^2 = \sum_{i=1}^n \sum_{j=1}^m (A_{ij} + B_{ij})^2\) and simplify.

Answer and justification for E1.3.9: \(q(x) = 3(x-1)^2 + 2\), minimum value is 2 at \(x = 1\).

Answer and justification for E1.3.11: \(\|A\|_F = \sqrt{14}\).

Answer and justification for E1.4.1: Since \(X_i\) is uniformly distributed on \([-1/2, 1/2]\), its probability density function is \(f_{X_i}(x) = 1\) for \(x \in [-1/2, 1/2]\) and 0 otherwise. Therefore,

\[\mathbb{E}[X_i] = \int_{-1/2}^{1/2} x f_{X_i}(x) dx = \int_{-1/2}^{1/2} x dx = 0\]

and

\[\text{Var}[X_i] = \mathbb{E}[X_i^2] - (\mathbb{E}[X_i])^2 = \int_{-1/2}^{1/2} x^2 dx = \frac{1}{12}.\]

Answer and justification for E1.4.3: We have

\[\mathbb{E}[\|X\|^2] = \mathbb{E}\left[\sum_{i=1}^d X_i^2\right] = \sum_{i=1}^d \mathbb{E}[X_i^2] = \sum_{i=1}^d 1 = d.\]

Answer and justification for E1.4.5: By Chebyshev’s inequality, \(P[|X - 1| \geq 3] \leq \frac{\mathrm{Var}[X]}{3^2} = \frac{4}{9}\).

Answer and justification for E1.4.7: By Chebyshev’s inequality, \(\P[|X| \geq 2\sqrt{d}] \leq \frac{\mathrm{Var}[X]}{(2\sqrt{d})^2} = \frac{1}{4d}\). For \(d = 1\), \(\P[|X| \geq 2] \leq \frac{1}{4}\). For \(d = 10\), \(P[|X| \geq 2\sqrt{10}] \leq \frac{1}{40}\). For \(d = 100\), \(\P[|X| \geq 20] \leq \frac{1}{400}\).

1.6.1.5. Learning outcomes#

  • Compute norms of vectors and matrices, and inner products between vectors.

  • Utilize properties of norms and inner products, including the Cauchy-Schwarz inequality and the triangle inequality.

  • Perform basic matrix operations, including addition, scalar multiplication, multiplication, and transposition.

  • Compute matrix-vector and matrix-matrix products and interpret their geometric and algebraic significance.

  • Explain the concept of a descent direction and its relationship to optimization.

  • Use Taylor’s Theorem to approximate functions and analyze the error terms in these approximations.

  • Define and compute partial derivatives and gradients for functions of multiple variables.

  • State and apply Chebyshev’s inequality to quantify the concentration of a random variable around its mean.

  • Compute the expectation, variance, and covariance of random variables and vectors.

  • Define and give examples of positive semidefinite matrices.

  • Explain the properties and significance of the covariance matrix of a random vector.

  • Describe and generate samples from a spherical Gaussian distribution.

  • Write Python code to compute norms, inner products, and perform matrix operations.

  • Use Python to simulate random variables, calculate their statistical properties, and visualize distributions and the law of large numbers.

  • Formulate the k-means clustering problem as an optimization problem.

  • Describe the k-means algorithm and its iterative steps for finding cluster centers and assignments.

  • Derive the optimal cluster representatives (centroids) for a fixed partition.

  • Derive the optimal partition (cluster assignments) for fixed representatives.

  • Analyze the convergence properties of the k-means algorithm and understand its limitations in finding global optima.

  • Apply the k-means algorithm to real-world datasets, such as the penguins dataset, and interpret the results.

  • Express the k-means objective function in matrix form and relate it to matrix factorization.

  • Understand the importance of data preprocessing steps like standardization for k-means.

  • Recognize the limitations of k-means, such as its sensitivity to initialization and the need to specify the number of clusters in advance.

  • Discuss the challenges of determining the optimal number of clusters in k-means clustering.

  • Understand the challenges of clustering high-dimensional data using the k-means algorithm and recognize how the increasing dimensionality can cause the noise to overwhelm the signal, making it difficult to distinguish between clusters.

  • Comprehend the concept of concentration of measure in high-dimensional spaces, specifically the counterintuitive fact that most of the volume of a high-dimensional cube is concentrated in its corners, making it appear like a “spiky ball.”

  • Apply Chebyshev’s inequality to derive the probability that a randomly selected point from a high-dimensional cube falls within the inscribed ball, and understand how this probability decreases as the dimensionality increases.

  • Analyze the behavior of the norm of a standard Normal vector in high-dimensional spaces and prove, using Chebyshev’s inequality, that the norm is highly likely to be close to the square root of the dimension, despite the joint probability density function being maximized at the origin.

  • Use Chebyshev’s inequality to derive probabilistic bounds in high-dimensional settings.

  • Simulate high-dimensional data to empirically verify theoretical results.

\(\aleph\)

1.6.2. Additional sections#

1.6.2.1. A more rigorous analysis of clustering in high dimension#

In this optional section, we give one formal statement of the phenomenon described in the previous subsection.

THEOREM Let \(\mathbf{X}_1, \mathbf{X}_2, \mathbf{Y}_1\) be independent spherical \(d\)-dimensional Gaussians with mean \(-w_d \mathbf{e}_1\) and variance \(1\), where \(\{w_d\}\) is a monotone sequence in \(d\). Let \(\mathbf{Y}_2\) be an indepedent spherical \(d\)-dimensional Gaussian with mean \(w_d \mathbf{e}_1\) and variance \(1\). Then, letting \(\Delta_d = \|\mathbf{Y}_1 - \mathbf{Y}_2\|^2 - \|\mathbf{X}_1 - \mathbf{X}_2\|^2\), as \(d \to +\infty\)

\[\begin{split} \frac{\mathbb{E}[\Delta_d]}{\sqrt{\mathrm{Var}[\Delta_d]}} \to \begin{cases} 0, & \text{if $w_d \ll d^{1/4}$}\\ +\infty, & \text{if $w_d \gg d^{1/4}$} \end{cases} \end{split}\]

where \(w_d \ll d^{1/4}\) means \(w_d/d^{1/4} \to 0\). \(\sharp\)

The ratio is the statement is referred to as the signal-to-noise ratio.

To prove the claim, we will need the following property.

LEMMA Let \(W\) be a real-valued random variable symmetric about zero, that is, such that \(W\) and \(-W\) are identically distributed. Then for all odd \(k\), \(\mathbb{E}[W^k] = 0\). \(\flat\)

Proof: By the symmetry,

\[ \mathbb{E}[W^k] = \mathbb{E}[(-W)^k] = \mathbb{E}[(-1)^k W^k] = - \mathbb{E}[W^k]. \]

The only way to satisfy this equation is to have \(\mathbb{E}[W^k] = 0\). \(\square\)

Returning to the proof of the claim:

Proof idea: (Theorem) The only coordinate contributing to \(\mathbb{E}[\Delta_d]\) is the first one by linearity of expectation, while all coordinates contribute to \(\mathrm{Var}[\Delta_d]\). More specifically, a calculation shows that the former is \(c_0 w^2\) while the latter is \(c_1 w^2 + c_2 d\), where \(c_0, c_1, c_2\) are constants.

Proof: (Claim) Write \(w := w_d\) and \(\Delta := \Delta_d\) to simplify the notation. There are two steps:

(1) Expectation of \(\Delta\): By defintion, the random variables \(X_{1,i} - X_{2,i}\), \(i = 1,\ldots, d\), and \(Y_{1,i} - Y_{2,i}\), \(i = 2,\ldots, d\), are identically distributed. So, by linearity of expectation,

\[\begin{align*} \mathbb{E}[\Delta] &= \sum_{i=1}^d \mathbb{E}[(Y_{1,i} - Y_{2,i})^2] - \sum_{i=1}^d \mathbb{E}[(X_{1,i} - X_{2,i})^2]\\ &= \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] - \mathbb{E}[(X_{1,1} - X_{2,1})^2]. \end{align*}\]

Further, we can write \(Y_{1,1} - Y_{1,2} \sim (Z_1 -w) - (Z_2+w)\) where \(Z_1, Z_2 \sim N(0,1)\) are independent, where here \(\sim\) indicates equality in distribution. Hence, we have

\[\begin{align*} \mathbb{E}[(Y_{1,1} - Y_{2,1})^2] &= \mathbb{E}[(Z_1 - Z_2 - 2w)^2]\\ &= \mathbb{E}[(Z_1 - Z_2)^2] - 4w \,\mathbb{E}[Z_1 - Z_2] + 4 w^2. \end{align*}\]

Similarly, \(X_{1,1} - X_{1,2} \sim Z_1 - Z_2\) so \(\mathbb{E}[(X_{1,1} - X_{2,1})^2] = \mathbb{E}[(Z_1 - Z_2)^2]\). Since \(\mathbb{E}[Z_1 - Z_2] = 0\), we finally get \(\mathbb{E}[\Delta] = 4 w^2\).

(2) Variance of \(\Delta\): Using the observations from (1) and the independence of the coordinates we get

\[\begin{align*} \mathrm{Var}[\Delta] &= \sum_{i=1}^d \mathrm{Var}[(Y_{1,i} - Y_{2,i})^2] + \sum_{i=1}^d \mathrm{Var}[(X_{1,i} - X_{2,i})^2]\\ &= \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] + (2d-1) \,\mathrm{Var}[(Z_1 - Z_2)^2]. \end{align*}\]

By the Variance of a Sum,

\[\begin{align*} \mathrm{Var}[(Z_1 - Z_2 - 2w)^2] &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2) + 4w^2]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2 - 4w(Z_1 - Z_2)]\\ &= \mathrm{Var}[(Z_1 - Z_2)^2] + 16 w^2 \mathrm{Var}[Z_1 - Z_2]\\ &\quad - 8w \,\mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2]. \end{align*}\]

Because \(Z_1\) and \(Z_2\) are independent, \(\mathrm{Var}[Z_1 - Z_2] = \mathrm{Var}[Z_1] + \mathrm{Var}[Z_2] = 2\). Moreover, the random variable \((Z_1 - Z_2)\) is symmetric, so

\[\begin{align*} \mathrm{Cov}[(Z_1 - Z_2)^2, Z_1 - Z_2] &= \mathbb{E}[(Z_1 - Z_2)^3]\\ & \quad - \mathbb{E}[(Z_1 - Z_2)^2] \,\mathbb{E}[Z_1 - Z_2]\\ &= 0. \end{align*}\]

Finally,

\[ \mathrm{Var}[\Delta] = 32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2] \]

Putting everything together:

\[ \frac{\mathbb{E}[\Delta]}{\sqrt{\mathrm{Var}[\Delta]}} = \frac{4 w^2}{\sqrt{32 w^2 + 2d \,\mathrm{Var}[(Z_1 - Z_2)^2]}}. \]

Taking \(d \to +\infty\) gives the claim. \(\square\)