\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

1.5. Exercises#

1.5.1. Warm-up worksheets#

(with help from Claude, Gemini, and ChatGPT)

Section 1.2

E1.2.1 Calculate the Euclidean norm of the vector \(\mathbf{x} = (6, 8)\).

E1.2.2 Compute the inner product \(\langle \mathbf{u}, \mathbf{v} \rangle\) for vectors \(\mathbf{u} = (1, 2, 3)\) and \(\mathbf{v} = (4, 5, 6)\).

E1.2.3 Find the transpose of the matrix \(A = \begin{pmatrix}1 & 2 & 3 \\ 4 & 5 & 6\end{pmatrix}\).

E1.2.4 Compute the Frobenius norm of the matrix \(A = \begin{pmatrix}1 & 2 \\ 3 & 4\end{pmatrix}\).

E1.2.5 Verify if the matrix \(A = \begin{pmatrix}2 & 0 \\ 0 & 3\end{pmatrix}\) is symmetric.

E1.2.6 Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\) and \(B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}\). Compute \(AB\) and \(BA\).

E1.2.7 Let \(f(x, y) = x^2 + xy + y^2\). Compute the partial derivatives \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\).

E1.2.8 Given the function \(g(x, y) = x^2 y + xy^3\), compute its partial derivatives \(\frac{\partial g}{\partial x}\) and \(\frac{\partial g}{\partial y}\).

E1.2.9 Let \(f(x) = x^3 - 3x^2 + 2x\). Use Taylor’s Theorem to find a linear approximation of \(f\) around \(a = 1\) with a second-order error term.

E1.2.10 Compute the expectation \(\E[X]\) of a discrete random variable \(X\) with \(\P(X = 1) = 0.2\), \(\P(X = 2) = 0.5\), and \(\P(X = 3) = 0.3\).

E1.2.11: Calculate the variance \(\mathrm{Var}[X]\) of a discrete random variable \(X\) with \(\P(X = 1) = 0.4\), \(\P(X = 2) = 0.6\), and \(\E[X] = 1.6\).

E1.2.12 Let \(X\) and \(Y\) be random variables with \(\mathbb{E}[X] = 2\), \(\mathbb{E}[Y] = 3\), \(\mathrm{Var}[X] = 4\), and \(\mathrm{Var}[Y] = 9\). Compute \(\mathbb{E}[2X + Y - 1]\) and \(\mathrm{Var}[2X + Y - 1]\), assuming \(X\) and \(Y\) are independent.

E1.2.13 Let \(X\) be a random variable with \(\mathbb{E}[X] = 3\) and \(\mathrm{Var}[X] = 4\). Use Chebyshev’s Inequality to bound \(\P[|X - 3| \geq 4]\).

E1.2.14 A random variable \(X\) has mean \(\mu = 5\) and variance \(\sigma^2 = 4\). Use Chebyshev’s Inequality to find an upper bound on the probability that \(X\) is more than 3 units away from its mean.

E1.2.15 Given the random vector \(X = (X_1, X_2)\) where \(X_1\) and \(X_2\) are independent standard normal random variables, what is the covariance matrix of \(X\)?

E1.2.16 Let \(X = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}\) be a random vector with \(\mathbb{E}[X_1] = 1\), \(\mathbb{E}[X_2] = 2\), \(\text{Var}[X_1] = 3\), \(\text{Var}[X_2] = 4\), and \(\text{Cov}[X_1, X_2] = 1\). Write the mean vector \(\mu_X\) and the covariance matrix \(\Sigma_X\) of \(X\).

E1.2.17 Let \(X = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}\) be a random vector with mean \(\mu_X = \begin{pmatrix} 1 \\ 2 \end{pmatrix}\) and covariance matrix \(\Sigma_X = \begin{pmatrix} 3 & 1 \\ 1 & 4 \end{pmatrix}\). Let \(A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}\). Compute \(\mathbb{E}[AX]\) and \(\text{Cov}[AX]\).

E1.2.18 Let \(X_1, X_2, X_3\) be independent random variables with \(\mathbb{E}[X_1] = 1\), \(\mathbb{E}[X_2] = 2\), \(\mathbb{E}[X_3] = 3\), \(\text{Var}[X_1] = 4\), \(\text{Var}[X_2] = 5\), and \(\text{Var}[X_3] = 6\). Compute \(\mathbb{E}[X_1 + X_2 + X_3]\) and \(\text{Var}[X_1 + X_2 + X_3]\).

E1.2.19 Let \(A = \begin{pmatrix} 2 & -1 \\ -1 & 2 \end{pmatrix}\). Show that \(A\) is positive definite.

Section 1.3

E1.3.1 Given the data points \(\mathbf{x}_1 = (1, 2)\), \(\mathbf{x}_2 = (-1, 1)\), \(\mathbf{x}_3 = (0, -1)\), and \(\mathbf{x}_4 = (2, 0)\), and the partition \(C_1 = \{1, 4\}\), \(C_2 = \{2, 3\}\), compute the optimal cluster representatives \(\boldsymbol{\mu}_1^*\) and \(\boldsymbol{\mu}_2^*\).

E1.3.2 For the data points and partition from E1.3.1, compute the \(k\)-means objective value \(\mathcal{G}(C_1, C_2)\).

E1.3.3 Given the cluster representatives \(\boldsymbol{\mu}_1 = (0, 1)\), \(\boldsymbol{\mu}_2 = (1, -1)\), and \(\boldsymbol{\mu}_3 = (-2, 0)\), and the data points \(\mathbf{x}_1 = (1, 1)\), \(\mathbf{x}_2 = (-1, -1)\), \(\mathbf{x}_3 = (2, -2)\), \(\mathbf{x}_4 = (-2, 1)\), and \(\mathbf{x}_5 = (0, 0)\), find the optimal partition \(C_1, C_2, C_3\).

E1.3.4 For the data points and cluster representatives from E1.3.3, compute the \(k\)-means objective value \(G(C_1, C_2, C_3; \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \boldsymbol{\mu}_3)\).

E1.3.5 Show that for any two cluster representatives \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2 \in \mathbb{R}^d\) and any data point \(\mathbf{x} \in \mathbb{R}^d\), the inequality \(\|\mathbf{x} - \boldsymbol{\mu}_1\|^2 \leq \|\mathbf{x} - \boldsymbol{\mu}_2\|^2\) is equivalent to \(2(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^T \mathbf{x} \leq \|\boldsymbol{\mu}_2\|^2 - \|\boldsymbol{\mu}_1\|^2\).

E1.3.6 Given the cluster assignment matrix

\[\begin{split} Z = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0 \end{bmatrix} \end{split}\]

and the cluster representative matrix

\[\begin{split} U = \begin{bmatrix} 1 & 2\\ -1 & 0\\ 0 & -2 \end{bmatrix}, \end{split}\]

compute the matrix product \(ZU\).

E1.3.7 For any two matrices \(A, B \in \mathbb{R}^{n \times m}\), prove that \(\|A + B\|_F^2 = \|A\|_F^2 + \|B\|_F^2 + 2 \sum_{i=1}^n \sum_{j=1}^m A_{ij} B_{ij}\).

E1.3.8 Show that for any matrix \(A \in \mathbb{R}^{n \times m}\) and any scalar \(c \in \mathbb{R}\), it holds that \(\|cA\|_F = |c| \|A\|_F\).

E1.3.9 Complete the square for the quadratic function \(q(x) = 3x^2 - 6x + 5\). What is the minimum value of \(q\) and where is it attained?

E1.3.10 Let \(f_1(x) = x^2 + 1\) and \(f_2(y) = 2y^2 - 3\). Define \(h(x, y) = f_1(x) + f_2(y)\). Find the global minimum of \(h(x, y)\).

E1.3.11 Compute the Frobenius norm of the matrix

\[\begin{split} A = \begin{bmatrix} 2 & -1 \\ 0 & 3 \end{bmatrix}. \end{split}\]

Section 1.4

E1.4.1 Let \(X_1, \dots, X_d\) be independent random variables uniformly distributed on \([-1/2, 1/2]\). Compute \(\mathbb{E}[X_i]\) and \(\text{Var}[X_i]\).

E1.4.2 Let \(X = (X_1, \dots, X_d)\) be a \(d\)-dimensional vector where each \(X_i\) is an independent random variable uniformly distributed on \([-1/2, 1/2]\). Compute \(\mathbb{E}[\|X\|^2]\).

E1.4.3 Let \(X = (X_1, \dots, X_d)\) be a \(d\)-dimensional vector where each \(X_i\) is an independent standard normal random variable. Compute \(\mathbb{E}[\|X\|^2]\).

E1.4.4 Let \(X_1, X_2, \ldots, X_d\) be independent random variables with \(E[X_i] = \mu\) and \(Var[X_i] = \sigma^2\) for all \(i\). Find \(\E[\sum_{i=1}^d X_i]\) and \(\mathrm{Var}[\sum_{i=1}^d X_i]\).

E1.4.5 Given a random variable \(X\) with \(\E[X] = 1\) and \(\mathrm{Var}[X] = 4\), bound \(\P[|X - 1| \geq 3]\) using Chebyshev’s Inequality.

E1.4.6 Let \(X = (X_1, \ldots, X_d)\) be a random vector with independent components, each following \(U[-\frac{1}{2}, \frac{1}{2}]\). Compute \(\P[\|X\|_2 \leq \frac{1}{2}]\) explicitly for \(d = 1, 2, 3\). [Hint: Use a geometric arugment.]

E1.4.7 Given a random variable \(X\) with \(\E[X] = 0\) and \(\mathrm{Var}[X] = 1\), bound \(\P[|X| \geq 2\sqrt{d}]\) using Chebyshev’s Inequality for \(d = 1, 10, 100\).

E1.4.8 Let \(X = (X_1, \ldots, X_d)\) be a random vector with independent components, each following \(U[-\frac{1}{2}, \frac{1}{2}]\). Find an upper bound for \(\P[\|X\|_2 \geq \frac{1}{2}]\) using Chebyshev’s Inequality for \(d = 1, 10, 100\).

1.5.2. Problems#

1.1 (Adapted from [VLMS]) Recall that for two vectors \(\mathbf{x}\) and \(\mathbf{y}\) of the same dimension, their inner product is \(\mathbf{x}^T \mathbf{y}\). A vector is said to be nonnegative if all of its entries are \(\geq 0\).

a) Show that the inner product of two nonnegative vectors is necessarily \(\geq 0\).

b) Suppose the inner product of two nonnegative vectors is zero. What can you say about their sparsity patterns, i.e., which of their entries are zero or nonzero.\(\lhd\)

1.2 (Adapted from [VLMS]) A Boolean \(n\)-vector is one for which all entries are either \(0\) or \(1\). Such vectors are used to encode whether each of \(n\) conditions holds, with \(a_i = 1\) meaning that condition \(i\) holds.

a) Another common encoding of the same information uses the two values \(-1\) and \(1\) for the entries. For example the Boolean vector \((0,1,1,0)\) would be written using this alternative encoding as \((-1, 1, 1, -1)\). Suppose that \(\mathbf{x}\) is a Boolean vector, and \(\mathbf{y}\) is a vector encoding the same information using the values \(-1\) and \(1\). Express \(\mathbf{y}\) in terms of \(\mathbf{x}\) using vector notation. Also, express \(\mathbf{x}\) in terms of \(\mathbf{y}\) using vector notation.

b) Suppose that \(\mathbf{x}\) and \(\mathbf{y}\) are Boolean \(n\)-vectors. Give a simple word description of their squared Euclidean distance \(\|\mathbf{x} - \mathbf{y}\|_2^2\) and justify it mathematically. \(\lhd\)

1.3 (Adapted from [VLMS]) If \(\mathbf{a}\) is a vector, then \(\mathbf{a}_{r:s}\) is the vector of size \(s - r + 1\), with entries \(a_r,\ldots,a_s\), i.e., \(\mathbf{a}_{r:s} = (a_r,\ldots,a_s)\). The vector \(\mathbf{a}_{r:s}\) is called a slice. As a more concrete example, if \(\mathbf{z}\) is the \(4\)-vector \((1, -1, 2, 0)\), the slice \(\mathbf{z}_{2:3} = (-1, 2)\). Suppose the \(T\)-vector \(\mathbf{x}\) represents a time series or signal. The quantity

\[ \mathcal{D}(\mathbf{x})=(x_1 - x_2)^2 + (x_2 - x_3)^2 + \cdots+(x_{T-1} - x_T)^2, \]

the sum of the differences of adjacent values of the signal, is called the Dirichlet energy of the signal. The Dirichlet energy is a measure of the roughness or wiggliness of the time series.

a) Express \(\mathcal{D}(\mathbf{x})\) in vector notation using slicing. [Hint: Note the similarity between \(\mathcal{D}(\mathbf{x})\) and the squared Euclidean distance between two vectors.]

b) How small can \(\mathcal{D}(\mathbf{x})\) be? What signals \(\mathbf{x}\) have this minimum value of the Dirichlet energy?

c) Find a signal \(\mathbf{x}\) with entries no more than one in absolute value that has the largest possible value of \(\mathcal{D}(\mathcal{x})\). Give the value of the Dirichlet energy achieved. \(\lhd\)

1.4 (Adapted from [VLMS]) A vector of length \(n\) can represent the number of times each word in a dictionary of \(n\) words appears in a document. For example, \((25,2,0)\) means that the first dictionary word appears \(25\) times, the second one twice, and the third one not at all. Suppose the \(n\)-vector \(\mathbf{w}\) is the word count vector associated with a document and a dictionary of \(n\) words. For simplicity we will assume that all words in the document appear in the dictionary.

a) What is \(\mathbf{1}^T \mathbf{w}\)? Here \(\mathbf{1}\) is an all-one vector of the appropriate size.

b) What does \(w_{282} = 0\) mean?

c) Let \(\mathbf{h}\) be the \(n\)-vector that gives the histogram of the word counts, i.e., \(h_i\) is the fraction of the words in the document that are word \(i\). Use vector notation to express \(\mathbf{h}\) in terms of \(\mathbf{w}\). (You can assume that the document contains at least one word.) \(\lhd\)

1.5 (Adapted from [VLMS]) Suppose \(\mathbf{a}\) and \(\mathbf{b}\) are vectors of the same size. The triangle inequality states that \(\|\mathbf{a} + \mathbf{b}\|_2 \leq \|\mathbf{a}\|_2 + \|\mathbf{b}\|_2\). Show that we also have \(\|\mathbf{a} + \mathbf{b}\|_2 \geq \|\mathbf{a}\|_2 - \|\mathbf{b}\|_2\). \(\lhd\)

1.6 (Adapted from [VLMS]) Verify that the following identities hold for any two vectors \(\mathbf{a}\) and \(\mathbf{b}\) of the same size.

a) \((\mathbf{a}+\mathbf{b})^T(\mathbf{a} - \mathbf{b})=\|\mathbf{a}\|_2^2 - \|\mathbf{b}\|_2^2\).

b) \(\|\mathbf{a} + \mathbf{b}\|_2^2 + \|\mathbf{a} - \mathbf{b}\|_2^2 = 2(\|\mathbf{a}\|)_2^2 + \|\mathbf{b}\|_2^2)\). This is called the Parallelogram Law. \(\lhd\)

1.7 (Adapted from [VLMS]) The root-mean-square (RMS) value of an \(n\)-vector \(\mathbf{x}\) is defined as

\[ \mathrm{rms}(\mathbf{x}) = \sqrt{\frac{x_1^2 + \cdots + x_n^2}{n}} = \frac{\|\mathbf{x}\|_2}{\sqrt{n}}. \]

a) Show that at least one entry of a vector has absolute value at least as large as the RMS value of the vector.

b) For an \(n\)-vector \(\mathbf{x}\), let \(\mathrm{avg}(\mathbf{x}) = \mathbf{1}^T \mathbf{x}/n\) and

\[ \mathrm{std}(\mathbf{x}) = \frac{\|\mathbf{x} - \mathrm{avg}(\mathbf{x}) \mathbf{1}\|_2}{\sqrt{n}}. \]

Establish the identity

\[ \mathrm{rms}(\mathbf{x})^2 = \mathbf{avg}(\mathbf{x})^2 + \mathbf{std}(\mathbf{x})^2. \]

c) Use (b) to show that \(|\mathrm{avg}(\mathbf{x})| \leq \mathrm{rms}(\mathbf{x})\).

d) Use (b) to show that \(\mathrm{std}(\mathbf{x}) \leq \mathrm{rms}(\mathbf{x})\). \(\lhd\)

1.8 (Adapted from [VLMS]) Suppose that the vectors \(\mathbf{x}_1,\ldots,\mathbf{x}_N\) are clustered using k-means, with group representatives \(\mathbf{z}_1,\ldots,\mathbf{z}_k\).

a) Suppose the original vectors \(\mathbf{x}_i\) are nonnegative, i.e., their entries are nonnegative. Explain why the representatives \(\mathbf{z}_j\) are also nonnegative.

b) Suppose the original vectors \(\mathbf{x}_i\) represent proportions, i.e., their entries are nonnegative and sum to one. Explain why the representatives \(\mathbf{z}_j\) also represent proportions, i.e., their entries are nonnegative and sum to one.

c) Suppose the original vectors \(\mathbf{x}_i\) are Boolean, i.e., their entries are either \(0\) or \(1\). Give an interpretation of \((\mathbf{z}_j)_i\), the \(i\)-th entry of the \(j\)-th group representative. \(\lhd\)

1.9 (Adapted from [VLMS]) Clustering a collection of vectors into \(k = 2\) groups is called \(2\)-way partitioning, since we are partitioning the vectors into \(2\) groups, with index sets \(G_1\) and \(G_2\). Suppose we run \(k\)-means with \(k = 2\) on the \(n\)-vectors \(\mathbf{x}_1,\ldots,\mathbf{x}_N\) . Show that there is a nonzero vector \(\mathbf{w}\) and a scalar \(v\) that satisfy

\[ \mathbf{w}^T \mathbf{x}_i + v \geq 0, \forall i \in G_1, \quad \mathbf{w}^T \mathbf{x}_i + v \leq 0, \forall i \in G_2. \]

In other words, the affine function \(f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + v\) is greater than or equal to zero on the first group, and less than or equal to zero on the second group. This is called linear separation of the two groups. [Hint: Consider the function \(\|\mathbf{x} - \mathbf{z}_1\|_2^2 - \|\mathbf{x} - \mathbf{z}_2\|_2^2\), where \(\mathbf{z}_1\) and \(\mathbf{z}_2\) are the group representatives.] \(\lhd\)

1.10 (Adapted from [ASV]) Let \(0 < p \leq 1\). A random variable \(X\) has the geometric distribution with success parameter \(p\) if the possible values of \(X\) are \(\{1, 2, 3, \ldots\}\) and \(X\) satisfies \(\mathbb{P}(X = k) = (1 - p)^{k - 1}p\) for positive integers \(k\). Its mean is \(1/p\) and its variance is \((1 - p)/p^2\).

Let \(X\) be a geometric random variable with parameter \(p = 1/6\).

a) Use Markov’s Inequality to find an upper bound for \(\mathbb{P}(X \geq 16)\).

b) Use Chebyshev’s Inequality to find an upper bound for \(\mathbb{P}(X \geq 16)\).

c) Use the sum of a geometric series to explicitly compute the probability \(\mathbb{P}(X \geq 16)\) and compare with the upper bounds you derived. \(\lhd\)

1.11 (Adapted from [ASV]) Let \(0 < \lambda < +\infty\). A random variable \(X\) has the exponential distribution with parameter \(\lambda\) if \(X\) has density function \(f(x) = \lambda e^{-\lambda x}\), for \(x \geq 0\) and \(0\) otherwise. Its mean is \(1/\lambda\) and its variance is \(1/\lambda^2\).

Let \(X\) be an exponential random variable with parameter \(\lambda = 1/2\).

a) Use Markov’s Inequality to find an upper bound for \(\mathbb{P}(X > 6)\).

b) Use Chebyshev’s Inequality to find an upper bound for \(\mathbb{P}(X > 6)\).

c) Use an integral to explicitly compute the probability \(\mathbb{P}(X > 6)\) and compare with the upper bounds you derived. \(\lhd\)

1.12 (Adapted from [ASV]) Suppose that \(X\) is a nonnegative random variable with \(\mathbb{E}[X] = 10\).

a) Use Markov’s Inequality to give an upper bound on the probability that \(X\) is larger than \(15\).

b) Suppose that we also know that \(\mathrm{Var}(X) = 3\). Use Chebyshev’s Inequality to give a better upper bound on \(\mathbb{P}(X > 15)\) than in part (a).

c) Suppose that \(Y_1,Y_2,\ldots,Y_{300}\) are i.i.d. random variables with the same distribution as \(X\), so that, in particular \(\mathbb{E}(Y_i) = 10\) and \(\mathrm{Var}(Y_i) = 3\). Use Chebyshev’s Inequality to give an upper bound on the probability that \(\sum_{i=1}^{300} Y_i\) is larger than \(3060\). \(\lhd\)

1.13 (Adapted from [ASV]) Suppose that we have i.i.d. random variables \(X_1, X_2, X_3,\ldots\) with finite mean \(\mathbb{E}[X_1] = \mu\) and variance \(\mathrm{Var}(X_1) = \sigma^2\). Let \(S_n = X_1 +\cdots+X_n\). Prove that for any fixed \(\varepsilon > 0\) and \(1/2 < \alpha < 1\) we have

\[ \lim_{n \to +\infty} \mathbb{P}\left[\left|\frac{S_n - n\mu}{n^\alpha}\right| < \varepsilon\right] = 1. \]

\(\lhd\)

1.14 (Adapted from [ASV]) By mimicking the proof of Weak Law of Large Numbers, prove the following variant. Suppose that we have random variables \(X_1, X_2, \ldots\) each with finite mean \(\mathbb{E}[X_i] = \mu\) and variance \(\mathrm{Var}(X_i) = \sigma^2\). Suppose further that \(\mathrm{Cov}(X_i,X_j) = 0\) whenever \(|i - j| \geq 2\) and that there is a constant \(c > 0\) so that \(|\mathrm{Cov}(X_i, X_{i+1})| < c\) for all \(i\). Let \(S_n =X_1 +\cdots+X_n\). Then for any fixed \(\varepsilon>0\) we have

\[ \lim_{n\to +\infty} \mathbb{P}\left[\left|\frac{S_n}{n} - \mu\right|\right]<\varepsilon=1. \]

\(\lhd\)

1.15 (Adapted from [ASV]) By mimicking the proof of Chebyshev’s Inequality, prove the following variant. Let \(X\) be a random variable with a finite mean \(\mu\) and for which \(\mathbb{E}[\exp(s(X - \mu)] < +\infty\) for some \(s > 0\). Then we have for \(c > 0\)

\[ \mathbb{P}(X \geq \mu + c) \leq \frac{\mathbb{E}[\exp(s(X - \mu)]}{e^{s c}}. \]

Justify your answer carefully. \(\lhd\)

1.16 (Adapted from [ASV]) Recall that the cumulative distribution function (CDF) of a real-valued random variable \(Z\) is the function \(F_Z(z) = \mathbb{P}[Z \leq z]\) for all \(z \in \mathbb{R}\). Let \(X_1, X_2, \ldots , X_n\) be independent random variables with the same cumulative distribution function \(F\). Denote the minimum and the maximum by

\[ Z = \min(X_1,X_2,\ldots,X_n) \quad\text{and}\quad W = \max(X_1,X_2,\ldots,X_n). \]

Find the cumulative distribution functions \(F_Z\) and \(F_W\) of \(Z\) and \(W\) respectively. \(\lhd\)

1.17 (Adapted from [ASV]) Suppose that \(X_1, X_2, \ldots\) are i.i.d. random variables with Exponential distribution with parameter \(\lambda = 1\) (see Exercise 1.11 above), and let \(M_n = \max(X_1,\ldots,X_n)\). Show that for any \(x \in \mathbb{R}\) we have

\[ \lim_{n \to +\infty} \mathbb{P}(M_n - \ln n \leq x) = \exp(-e^{-x}). \]

The right hand side is the cumulative distribution function of the Gumbel distribution, an example of an extreme value distribution. [Hint: Use Exercise 1.16 to compute \(\mathbb{P}(M_n \leq \ln n + x)\) explicitly, and then evaluate the limit as \(n \to +\infty\).] \(\lhd\)

1.18 (Adapted from [ASV]) Let \(A\) and \(B\) be two disjoint events. Under what condition are they independent? \(\lhd\)

1.19 (Adapted from [ASV]) We choose a number from the set \(\{10, 11, 12, \ldots , 99\}\) uniformly at random.

a) Let \(X\) be the first digit and \(Y\) the second digit of the chosen number. Show that \(X\) and \(Y\) are independent random variables.

b) Let \(X\) be the first digit of the chosen number and \(Z\) the sum of the two digits. Show that \(X\) and \(Z\) are not independent.

\(\lhd\)

1.20 (Adapted from [ASV]) Suppose that \(X\) and \(Y\) have joint probability density function \(f(x,y) = 2e^{-(x+2y)}\) for \(x > 0, y > 0\) and \(0\) elsewhere. Show that \(X\) and \(Y\) are independent random variables and provide their marginal distributions. [Hint: Recall the exponential distribution from Problem 1.11.] \(\lhd\)

1.21 (Adapted from [ASV]) Suppose that \(X,Y\) have joint probability density function

\[ f(x,y) = c\exp\left[-\frac{x^2}{2} - \frac{(x - y)^2}{2}\right], \]

for \(x,y \in \mathbb{R}^2\), for some constant \(c > 0\).

a) Find the value of the constant \(c\). [Hint: The order of integration matters. You can do this without doing complicated integrals. Specifically, recall that for any \(\mu \in \mathbb{R}\) and \(\sigma > 0\), it holds that

\[ \int_{-\infty}^{+\infty} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(z - \mu)^2}{2\sigma^2}\right) \mathrm{d} z= 1. \]

]

b) Find the marginal density functions of \(X\) and \(Y\). [Hint: You can do this without doing complicated integrals. Complete the square.]

c) Determine whether \(X\) and \(Y\) are independent. Justify your answer. \(\lhd\)

1.22 (Adapted from [ASV]) Let \(p(x, y)\) be the joint probability mass function of \((X, Y)\). Assume that there are two functions \(a\) and \(b\) such that \(p(x,y) = a(x)b(y)\) for all possible values \(x\) of \(X\) and \(y\) of \(Y\). Show that \(X\) and \(Y\) are independent. Do not assume that \(a\) and \(b\) are probability mass functions. \(\lhd\)

1.23 (Adapted from [BHK]) a) Let \(X\) and \(Y\) be independent random variables with uniform distribution in \([-1/2 , 1/2]\). Compute \(\mathbb{E}[X]\), \(\mathbb{E}[X^2]\), \(\mathrm{Var}[X^2]\), \(\mathbb{E}[X - Y]\), \(\mathbb{E}[XY]\), and \(\mathbb{E}[(X - Y)^2]\).

b) What is the expected squared Euclidean distance between two points generated at random inside the unit d-dimensional cube \(\mathcal{C} = [-1/2,1/2]^d\)? \(\lhd\)

1.24 (Adapted from [BHK]) a) For an arbitrary \(a > 0\) give a probability distribution for a nonnegative random variable \(X\) such that

\[ \mathbb{P}[X \geq a] = \frac{\mathbb{E}[X]}{a}. \]

b) Show that for any \(c > 0\) there exists a distribution for which Chebyshev’s Inequality is tight, that is, $\( \mathbb{P}[|X - \mathbb{E}[X]| \geq c] = \frac{\mathrm{Var}[X]}{c^2}. \)$

[Hint: Choose a distribution symmetric about \(0\).] \(\lhd\)

1.25 (Adapted from [BHK]) Let \(X_1, X_2, \ldots , X_n\) be i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\). Let

\[ \overline{X}_n = \frac{1}{n} \sum_{i=1}^n X_i, \]

be the sample mean. Suppose one estimates the variance using the sample mean as follows

\[ S^2_n = \frac{1}{n} \sum_{i=1}^n \left(X_i - \overline{X}_n\right)^2. \]

Calculate \(\mathbb{E}(S^2_n)\). [Hint: Replace \(X_i - \overline{X}_n\) with \((X_i - \mu) - (\overline{X}_n - \mu)\).] \(\lhd\)

1.26 Let \(f\) and \(g\) have derivatives at \(x\) and let \(\alpha\) and \(\beta\) be constants. Prove from the definition of the derivative that

\[ [\alpha f(x) + \beta g(x)]' = \alpha f'(x) + \beta g'(x). \]

\(\lhd\)

1.27 Let \(Z_1 \sim \mathrm{N}(\mu_1, \sigma_1^2)\) and \(Z_2 \sim \mathrm{N}(\mu_2, \sigma_2^2)\) be independent Normal variables with mean \(\mu_1\), \(\mu_2\) and variance \(\sigma_1^2\) and \(\sigma_2^2\) respectively. Recall that \(Z_1 + Z_2\) is still Normal.

a) What are the mean and variance of \(Z_1 + Z_2\)?

b) Let \(\mathbf{X}_1, \mathbf{X}_2, \mathbf{Y}_1\) be independent spherical \(d\)-dimensional Gaussians with mean \(-w \mathbf{e}_1\) and variance \(1\). Let \(\mathbf{Y}_2\) be an indenpedent spherical \(d\)-dimensional Gaussian with mean \(w \mathbf{e}_1\) and variance \(1\). Compute \(\mathbb{E}[\|\mathbf{X}_1 - \mathbf{X}_2\|^2]\) and \(\mathbb{E}[\|\mathbf{Y}_1 - \mathbf{Y}_2\|^2]\) explicitly. \(\lhd\)

1.28 Suppose we are given \(n\) vectors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\) in \(\mathbb{R}^d\) and a partition \(C_1, \ldots, C_k \subseteq [n]\). Let \(n_i = |C_i|\) be the size of cluster \(C_i\) and let

\[ \boldsymbol{\mu}_i^* = \frac{1}{n_i} \sum_{j\in C_i} \mathbf{x}_j \]

be the centroid of \(C_i\), for \(i=1,\ldots,k\).

a) Show that

\[ \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i^*\|^2 = \left(\sum_{j \in C_i} \|\mathbf{x}_j\|^2\right) - n_i\|\boldsymbol{\mu}_i^*\|^2. \]

b) Show that

\[\begin{split} \|\boldsymbol{\mu}_i^*\|^2 = \frac{1}{n_i^2}\left(\sum_{j \in C_i} \|\mathbf{x}_j\|^2 + \sum_{\substack{j, \ell \in C_i\\j \neq \ell}} \mathbf{x}_j^T\mathbf{x}_\ell\right). \end{split}\]

c) Show that

\[\begin{split} \sum_{\substack{j, \ell \in C_i\\j \neq \ell}} \|\mathbf{x}_j - \mathbf{x}_\ell\|^2 = 2(n_i-1)\sum_{j \in C_i} \|\mathbf{x}_j\|^2 - 2 \sum_{\substack{j, \ell \in C_i\\j \neq \ell}} \mathbf{x}_j^T\mathbf{x}_\ell. \end{split}\]

d) Combine (a), (b), © to prove that minimizing the \(k\)-means objective \(\mathcal{G}(C_1,\ldots,C_k)\) over all partitions \(C_1, \ldots, C_k\) of \([n]\) is equivalent to minimizing instead

\[ \sum_{i=1}^k \frac{1}{2 |C_i|} \sum_{j,\ell \in C_i} \|\mathbf{x}_j - \mathbf{x}_\ell\|^2. \]

\(\lhd\)

1.29 Suppose the rows of \(A \in \mathbb{R}^{n \times m}\) are given by the transposes of \(\mathbf{r}_1,\ldots,\mathbf{r}_n \in \mathbb{R}^m\) and the columns of \(A\) are given by \(\mathbf{c}_1,\ldots,\mathbf{c}_m \in \mathbb{R}^n\). Give expressions for the elements of \(A^T A\) and \(A A^T\) in terms of these vectors. \(\lhd\)

1.30 Give a proof of the Positive Semidefiniteness of the Covariance using the matrix form of \(\bSigma_\bX\). \(\lhd\)

1.31 Use the Cauchy-Schwarz Inequality to prove that the correlation coefficient lies in \([-1,1]\). \(\lhd\)

1.32 Let \(f(x,y) = g(x) + h(y)\) where \(x, y \in \mathbb{R}\) and \(g,h\) are real-valued continuously differentiable functions. Compute the gradient of \(f\) in terms of the derivatives of \(g\) and \(h\). \(\lhd\)

1.33 Let \(A = [a]\) be a \(1\times 1\) positive definite matrix. Show that \(a > 0\). \(\lhd\)

1.34 Show that the diagonal elements of a positive definite matrix are necessarily positive. \(\lhd\)

1.35 Let \(A, B \in \mathbb{R}^{n \times m}\) and let \(c \in \mathbb{R}\). Show that

a) \((A + B)^T = A^T + B^T\)

b) \((c A)^T = c A^T\)

\(\lhd\)