Background: quick refresher of matrix algebra, differential calculus, and elementary probability

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\bX}{\mathbf{X}}\)

1.2. Background: quick refresher of matrix algebra, differential calculus, and elementary probability#

We first review a few basic mathematical concepts. In this chapter, we focus on vector and matrix algebra, some basic calculus and optimization, as well as elementary probability concepts. Further mathematical concepts will be reviewed in the next chapters. Along the way, we also introduce Python, especially Numpy.

1.2.1. Vectors and matrices#

Vectors and norms For a vector

\[\begin{split} \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \in \mathbb{R}^d \end{split}\]

the Euclidean norm of \(\mathbf{x}\) is defined as

\[ \|\mathbf{x}\|_2 = \sqrt{ \sum_{i=1}^d x_i^2 } = \sqrt{\langle \mathbf{x}, \mathbf{x}\rangle} \]

where

\[ \langle \mathbf{u}, \mathbf{v} \rangle = \sum_{i=1}^d u_i v_i \]

is the inner product of \(\mathbf{u}\) and \(\mathbf{v}\). This is also known as the \(\ell^2\)-norm.

The inner product has the following useful properties (Check them!). For one, it is symmetric in the sense that

\[ \langle \mathbf{x}, \mathbf{y} \rangle = \langle \mathbf{y}, \mathbf{x} \rangle \qquad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^n. \]

Second, it is linear in each input: for any \(\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3 \in \mathbb{R}^n\) and \(\beta \in \mathbb{R}\), it holds that

\[ \langle \beta \,\mathbf{x}_1 + \mathbf{x}_2, \mathbf{x}_3 \rangle = \beta \,\langle \mathbf{x}_1,\mathbf{x}_3\rangle + \langle \mathbf{x}_2,\mathbf{x}_3\rangle. \]

Repeated application of the latter property implies for instance that: for any \(\mathbf{x}_1, \ldots, \mathbf{x}_m, \mathbf{y}_1, \ldots, \mathbf{y}_\ell, \in \mathbb{R}^n\),

\[ \left\langle \sum_{i=1}^m \mathbf{x}_i, \sum_{j=1}^\ell \mathbf{y}_j \right\rangle = \sum_{i=1}^m \sum_{j=1}^\ell \langle \mathbf{x}_i,\mathbf{y}_j\rangle. \]

The triangle inequality for the \(\ell^2\)-norm follows from the Cauchy–Schwarz inequality, which is useful in proving many facts.

THEOREM (Cauchy–Schwarz) For all \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^d\)

\[ |\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\|_2 \|\mathbf{v}\|_2. \]

\(\sharp\)

Given a collection of vectors \(\mathbf{u}_1,\ldots,\mathbf{u}_k \in \mathbb{R}^d\) and real numbers \(\alpha_1,\ldots,\alpha_k \in \mathbb{R}\), the linear combination of \(\mathbf{u}_\ell\)’s with coefficients \(\alpha_\ell\)’s is the vector

\[ \mathbf{z} = \sum_{\ell=1}^k \alpha_\ell \mathbf{u}_\ell, \]

whose entries are

\[ z_i = \sum_{\ell=1}^k \alpha_\ell (\mathbf{u}_\ell)_i, \quad i=1,\ldots,d. \]

It will be convenient to introduce special notation for common vectors. The dimension of these vectors will often be clear from the context.

  • The all-\(0\) vector will be denoted by \(\mathbf{0}\).

  • The all-\(1\) vector will be denoted by \(\mathbf{1}\).

  • The standard basis (we will review the notion of basis in the next chapter) will be denoted by \(\mathbf{e}_i\), \(i=1,\ldots,d\), where

\[\begin{split} (\mathbf{e}_i)_j = \begin{cases} 1, & \text{if $j = i$,}\\ 0, & \text{o.w.} \end{cases} \end{split}\]

The Euclidean distance between two vectors \(\mathbf{u}\) and \(\mathbf{v}\) in \(\mathbb{R}^d\) is the \(2\)-norm of their difference

\[ d(\mathbf{u},\mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_2. \]

Throughout we use the notation \(\|\mathbf{x}\| = \|\mathbf{x}\|_2\) to indicate the \(2\)-norm of \(\mathbf{x}\) unless specified otherwise.

More generally, for \(p \geq 1\), the \(\ell^p\)-norm of \(\mathbf{x}\) is given by

\[ \|\mathbf{x}\|_p = \left( \sum_{i=1}^d |x_i|^p \right)^{1/p}. \]

Here is a nice visualization of the unit ball, that is, the set \(\{\mathbf{x}:\|x\|_p \leq 1\}\), under varying \(p\).

Finally the \(\ell^\infty\)-norm, is defined as

\[ \|\mathbf{x}\|_\infty = \max_{i=1,\ldots,d}|x_i|. \]

There exist other norms. Formally:

DEFINITION (Norm): A norm is a function \(\ell\) from \(\mathbb{R}^d\) to \(\mathbb{R}_+\) that satisfies for all \(a \in \mathbb{R}\), \(\mathbf{u}, \mathbf{v} \in \mathbb{R}^d\)

  • (Absolute homogeneity): \(\ell(a \mathbf{u}) = |a| \ell(\mathbf{u})\)

  • (Triangle inequality): \(\ell(\mathbf{u}+\mathbf{v}) \leq \ell(\mathbf{u}) + \ell(\mathbf{v})\)

  • (Point-separating): \(\ell(\mathbf{u}) = 0\) implies \(\mathbf{u} =0\).

\(\natural\)

NUMERICAL CORNER: In Numpy, a vector is defined as a 1d array. We first must import the Numpy package, which is often abbreviated by np.

import numpy as np
u = np.array([1., 3., 5. ,7.])
print(u)
[1. 3. 5. 7.]

To obtain the norm of a vector, we can use the function linalg.norm, which requires the numpy.linalg package:

from numpy import linalg as LA
LA.norm(u)
9.16515138991168

which we check next “by hand”

np.sqrt(np.sum(u ** 2))
9.16515138991168

In Numpy, ** indicates element-wise exponentiation.

TRY IT! Compute the inner product of \(u = (1,2,3,4)\) and \(v = (5, 4, 3, 2)\) without using the function np.dot. Hint: The product of two real numbers \(a\) and \(b\) is a * b. (Open in Colab)

u = np.array([1., 2., 3. ,4.])
# EDIT THIS LINE: define v
# EDIT THIS LINE: compute the inner product between u and v

\(\unlhd\)

Matrices For an \(n \times m\) matrix \(A \in \mathbb{R}^{n \times m}\) with real entries, we denote by \(A_{ij}\) its entry in row \(i\) and column \(j\). We also refer to a matrix as the collection of all of its entries as follows

\[ A = (A_{ij})_{i\in [n],j \in [m]}. \]

We occasionally simplify the notation to \((A_{ij})_{i,j}\) when the range of the indices is clear from context. We use the notation

\[ A_{i,\cdot} = (A_{i1} \cdots A_{im}), \]

to indicate the \(i\)-th row of \(A\) – as a row vector, i.e., a matrix with a single row – and similarly

\[\begin{split} A_{\cdot,j} = \begin{pmatrix} A_{1j}\\ \vdots\\ A_{nj} \end{pmatrix}, \end{split}\]

the \(j\)-th column of \(A\) - as a column vector, i.e., a matrix with a single column.

EXAMPLE: Suppose

\[\begin{split} A = \begin{bmatrix} 2 & 5\\ 3 & 6\\ 1 & 1 \end{bmatrix}. \end{split}\]

Then the second row is

\[ A_{2,\cdot} = \begin{bmatrix} 3 & 6 \end{bmatrix}, \]

and the second column is

\[\begin{split} A_{\cdot,2} = \begin{bmatrix} 5\\ 6\\ 1 \end{bmatrix}. \end{split}\]

\(\lhd\)

Matrices can be multiplied by scalars: let \(A \in \mathbb{R}^{n\times m} = (A_{ij})_{i\in [n],j \in [m]}\) and \(\gamma \in \mathbb{R}\), then \(\gamma A = (\gamma A_{ij})_{i\in [n],j \in [m]}\) is the matrix whose entries are multiplied by \(\gamma\). Matrices can also be added to each other – provided they have the same size: let \(A \in \mathbb{R}^{n\times m} = (A_{ij})_{i\in [n],j \in [m]}\) and \(B \in \mathbb{R}^{n\times m} = (B_{ij})_{i\in [n],j \in [m]}\), then \(C = A + B\) is the matrix \(C = (C_{ij})_{i\in [n],j \in [m]}\) where \(C_{ij} = A_{ij} + B_{ij}\) for all \(i,j\).

Recall that the transpose \(A^T\) of a matrix \(A \in \mathbb{R}^{n\times m}\) is defined as the matrix in \(\mathbb{R}^{m\times n}\) that switches the row and column indices of \(A\), that is, its entries are

\[ [A^T]_{ij} = A_{ji},\quad i=1,\ldots,m, j=1,\ldots,n. \]

Figure: Illustration of transposition (Source)

transpose

\(\bowtie\)

EXAMPLE: Suppose again

\[\begin{split} A = \begin{bmatrix} 2 & 5\\ 3 & 6\\ 1 & 1 \end{bmatrix}. \end{split}\]

Then its transpose is

\[\begin{split} A^T = \begin{bmatrix} 5 & 6 & 1\\ 2 & 3 & 1 \end{bmatrix}. \end{split}\]

\(\lhd\)

We list some useful properties of the transpose (check them!). For any \(\gamma \in \mathbb{R}\) and \(A, B \in \mathbb{R}^{n \times m}\):

a) \((A^T)^T = A\)

b) \((\gamma A + B)^T = \gamma A^T + B^T\)

DEFINITION (Symmetric Matrix) A square matrix \(B \in \mathbb{R}^{n \times n}\) is symmetric if \(B^T = B\). \(\natural\)

The transpose in particular can be used to turn a column vector into a row vector and vice versa. That is, if \(\mathbf{b} = (b_1, b_2, \ldots,b_n) \in \mathbb{R}^n\) is a column vector

\[\begin{split} \mathbf{b} = \begin{bmatrix} b_1\\ b_2\\ \vdots\\ b_n \end{bmatrix}, \end{split}\]

then \(\mathbf{b}^T\) is the corresponding row vector

\[ \mathbf{b}^T = \begin{bmatrix} b_1 & b_2 & \cdots & b_n \end{bmatrix}. \]

For instance,

\[ \mathbf{b}^T \mathbf{b} = b_1^2 + \cdots + b_n^2 = \sum_{i=1}^n b_i^2 = \|\mathbf{b}\|_2^2 \]

is the squared Euclidean norm of \(\mathbf{b}\).

NUMERICAL CORNER: We will often work with collections of \(n\) vectors \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) in \(\mathbb{R}^d\) and it will be convenient to stack them up into a matrix

\[\begin{split} X = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \vdots \\ \mathbf{x}_n^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}. \end{split}\]

To create a matrix out of two vectors, we use the function numpy.stack.

u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
X = np.stack((u,v),axis=0)
print(X)
[[1. 3. 5. 7.]
 [2. 4. 6. 8.]]

Quoting the documentation:

The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.

The same scheme still works with more than two vectors.

u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
w = np.array([9., 8., 7., 6.])
X = np.stack((u,v,w))
print(X)
[[1. 3. 5. 7.]
 [2. 4. 6. 8.]
 [9. 8. 7. 6.]]

\(\unlhd\)

As for vectors, it will be convenient to introduce special notation for common matrices.

  • The all-\(0\) matrix of dimension \(m \times n\) will be denoted by \(\mathbf{0}_{m \times n}\).

  • The all-\(1\) matrix of dimension \(m \times n\) will be denoted by \(J_{m \times n}\).

  • The identity matrix of dimension \(d \times d\) will be denoted by \(I_{d \times d}\). Specifically, this the matrix whose \(i\)-th column is the standard basis vector \(\mathbf{e}_i\), \(i=1,\ldots,d\). Put differently, it is the square diagonal matrix with ones on the diagonal.

Matrix-vector product Recall that, for a matrix \(A = (A_{ij})_{i\in [n],j \in [m]} \in \mathbb{R}^{n \times m}\) and a column vector \(\mathbf{b} = (b_{i})_{i\in [m]} \in \mathbb{R}^{m}\), the matrix-vector product \(\mathbf{c} = A \mathbf{b}\) is the vector with entries

\[ c_i = (A\mathbf{b})_i = \sum_{j=1}^m A_{ij} b_j. \]

In vector form,

\[ A \mathbf{b} = \sum_{j=1}^m A_{\cdot,j} b_j, \]

that is, \(A \mathbf{b}\) is a linear combination of the columns of \(A\) where the coefficients are the entries of \(\mathbf{b}\). Matrix-vector products are linear in the following sense (Check it!): for any \(\gamma \in \mathbb{R}\) and \(\mathbf{b}_1, \mathbf{b}_2 \in \mathbb{R}^m\)

\[ A(\gamma \mathbf{b}_1 + \mathbf{b}_2) = \gamma A\mathbf{b}_1 + A \mathbf{b}_2. \]

EXAMPLE: (continued) Consider the column vector \(\mathbf{b} = (1, 0)^T\). Then

\[\begin{split} A \mathbf{b} = \begin{bmatrix} 2(1) + 5(0)\\ 3(1) + 6(0)\\ 1(1) + 1(0) \end{bmatrix} = \begin{bmatrix} 2\\ 3\\ 1 \end{bmatrix}, \end{split}\]

which can also be written in vector form as

\[\begin{split} (1) \begin{bmatrix} 2\\ 3\\ 1 \end{bmatrix} + (0) \begin{bmatrix} 5\\ 6\\ 1 \end{bmatrix} = \begin{bmatrix} 2\\ 3\\ 1 \end{bmatrix}. \end{split}\]

\(\lhd\)

Matrix-matrix product Recall that, for matrices \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{k \times m}\), their matrix product is defined as the matrix \(C = AB \in \mathbb{R}^{n \times m}\) whose entries are

\[ C_{i\ell} = (AB)_{i\ell} = \sum_{j=1}^k A_{ij} B_{j\ell}, \qquad \ell=1,\ldots,m. \]

Note that the number of columns of \(A\) and the number of rows of \(B\) must match. There are many different ways to view this formula that are helpful in interpreting matrix-matrix products in different contexts.

First, we observe that the entry \(C_{i\ell}\) is an inner product of the \(i\)-th row of \(A\) and of the \(\ell\)-th column of \(B\). That is,

\[ C_{i\ell} = A_{i,\cdot} B_{\cdot,\ell}. \]

In matrix form,

\[\begin{split} AB = \begin{bmatrix} A_{1,\cdot} B_{\cdot,1} & A_{1,\cdot} B_{\cdot,2} & \cdots & A_{1,\cdot} B_{\cdot,m} \\ A_{2,\cdot} B_{\cdot,1} & A_{2,\cdot} B_{\cdot,2} & \cdots & A_{2,\cdot} B_{\cdot,m} \\ \vdots & \vdots & \ddots & \vdots\\ A_{n,\cdot} B_{\cdot,1} & A_{n,\cdot} B_{\cdot,2} & \cdots & A_{n,\cdot} B_{\cdot,m} \end{bmatrix}. \end{split}\]

Alternatively, the previous display can be re-written as

\[ AB = \begin{bmatrix} A (B_{\cdot,1}) & A (B_{\cdot,2}) & \cdots & A (B_{\cdot,m}) \end{bmatrix}, \]

where we specify a matrix by the collection of its columns.

Put differently, by the matrix-vector product formula above, the \(j\)-th column of the product \(AB\) is a linear combination of the columns of \(A\) where the coefficients are the entries in column \(j\) of \(B\)

\[ (AB)_{\cdot,j} = A B_{\cdot,j} = \sum_{\ell=1}^k A_{\cdot,\ell} B_{\ell j}. \]

Similarly, the \(i\)-th row of the product \(AB\) is a linear combination of the rows of \(B\) where the coefficients are the entries in row \(i\) of \(A\)

\[ (AB)_{i,\cdot} = \sum_{\ell=1}^k A_{i\ell} B_{\ell,\cdot}. \]

Writing \(AB\) as the collection of its rows, this is

\[\begin{split} AB = \begin{bmatrix} A_{1,\cdot} B\\ A_{2,\cdot} B\\ \vdots\\ A_{n,\cdot} B \end{bmatrix}, \end{split}\]

where recall that \(A_{i,\cdot}\) is a row vector.

EXAMPLE: Recall that if we think of a vector \(\mathbf{b} \in \mathbb{R}^n\) as a column vector, then its transpose \(\mathbf{b}^T\) is a row vector. We previously showed that \(\mathbf{b}^T \mathbf{b} = \sum_{i=1}^n b_i^2\) is a scalar, i.e., a real number. This time, we compute \(\mathbf{b} \mathbf{b}^T\).

Let us first make sure that the dimensions fit. Seeing these vectors as matrices, we have \(\mathbf{b} \in \mathbb{R}^{n\times 1}\) and \(\mathbf{b} \in \mathbb{R}^{1\times n}\). So indeed we can multiply them together since the number of columns of the first matrix matches the number of rows of the second one.

What are the dimensions of the final product? Taking the number of rows of the first matrix and the number of columns of the second one, we see that it is \(n \times n\).

Finally we get

\[\begin{split} \mathbf{b} \mathbf{b}^T = \begin{bmatrix} b_1\\ b_2\\ \vdots\\ b_n \end{bmatrix} \begin{bmatrix} b_1 & b_2 & \cdots & b_n \end{bmatrix} = \begin{bmatrix} b_{1} b_{1} & b_{1} b_{2} & \cdots & b_{1} b_{n} \\ b_{2} b_{1} & b_{2} b_{2} & \cdots & b_{2} b_{n} \\ \vdots & \vdots & \ddots & \vdots\\ b_{n} b_{1} & b_{n} b_{2} & \cdots & b_{n} b_{n} \end{bmatrix}. \end{split}\]

That is, \((\mathbf{b} \mathbf{b}^T)_{i,j} = b_i b_j\). \(\lhd\)

We list some useful properties of the matrix-matrix product (check them!). For any \(\gamma \in \mathbb{R}\), \(A, B \in \mathbb{R}^{n \times m}\) and \(C \in \mathbb{R}^{m \times \ell}\):

a) \((\gamma A)B = A (\gamma B) = \gamma A B\)

b) \((A + B)C = AC + BC\)

c) \((BC)^T = C^T B^T\)

MULTIPLE CHOICE: Let \(A \in \mathbb{R}^{n \times m}\), \(B \in \mathbb{R}^{m \times n}\), \(C \in \mathbb{R}^{n \times \ell}\), and \(D \in \mathbb{R}^{\ell \times n}\). Determine the dimensions of the transpose of the matrix:

\[ (A^T + B) C D \]

a) \(m \times m\)

b) \(n \times n\)

c) \(m \times n\)

d) \(n \times m\)

e) The matrix is not well-defined.

\(\ddagger\)

Matrix norms We will also need a notion of matrix norm. A natural way to define a norm for matrices is to notice that an \(n \times m\) matrix \(A\) can be thought of as an \(nm\) vector, with one element for each entry of \(A\). Indeed, addition and scalar multiplication work exactly in the same way. Hence, we can define the \(2\)-norm of a matrix in terms of the sum of its squared entries. (We will encounter other matrix norms later in the course.)

DEFINITION (Frobenius Norm) The Frobenius norm of an \(n \times m\) matrix \(A \in \mathbb{R}^{n \times m}\) is defined as

\[ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m A_{ij}^2}. \]

\(\natural\)

Using the row notation, we see that the square of the Frobenius norm can be written as the sum of the squared Euclidean norms of the rows

\[ \|A\|_F^2 = \sum_{i=1}^n \|A_{i,\cdot}\|_2^2. \]

Similarly in terms of the columns \(A_{\cdot,j}\), \(j=1,\ldots,m\), of \(A\) we have \(\|A\|_F^2 = \sum_{j=1}^m \|A_{\cdot,j}\|_2^2\).

For two matrices \(A, B \in \mathbb{R}^{n \times m}\), the Frobenius norm of their difference \(\|A - B\|_F\) can be interpreted as a distance between \(A\) and \(B\), that is, a measure of how dissimilar they are.

It can be shown (try it!) using Cauchy-Schwarz that for any \(A, B\) for which \(AB\) is well-defined it holds that

\[ \|A B \|_F \leq \|A\|_F \|B\|_F. \]

This applies in particular when \(B\) is a column vector, in which case \(\|B\|_F\) is its Euclidean norm.

NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the function numpy.linalg.norm.

A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.]
 [0. 1.]
 [0. 0.]]
LA.norm(A)
1.4142135623730951

\(\unlhd\)

Quadratic forms Let \(B \in \mathbb{R}^{n \times n}\) be a square matrix. The associated quadratic form

\[ \langle \mathbf{z}, B \mathbf{z} \rangle = \mathbf{z}^T B \mathbf{z} = \sum_{i=1}^n z_i \sum_{j=1}^n B_{i,j} z_j = \sum_{i=1}^n \sum_{j=1}^n z_i B_{i,j} z_j \]

defined for any \(\mathbf{z} = (z_1,\ldots,z_n)\), will make many appearances throughout.

A form is a homogeneous polynomial \(f(\mathbf{z})\), viewed as a function of \(\mathbf{z}\). By homogeneous, we mean that for any \(\mathbf{z} \in \mathbb{R}^n\) and any scalar \(\alpha \in \mathbb{R}\)

\[ f(\alpha \mathbf{z}) = \alpha^k f(\mathbf{z}), \]

for some integer \(k\) that is called the degree of homogeneity. (Note that this is different from the absolute homogeneity of norms.) When \(k=2\), we refer to a quadratic form. Let us check that \(\langle \mathbf{z}, B \mathbf{z} \rangle\) indeed satisfies these properties. The alternative expression \(\sum_{i=1}^n \sum_{j=1}^n z_i B_{i,j} z_j\) makes it clear that it is a [polynomial] in the variables \(z_1,\ldots,z_n\). Moreover, for any \(\alpha \in \mathbb{R}\), by using linearity multiple times

\[ \langle \alpha \mathbf{z}, B (\alpha \mathbf{z}) \rangle = \langle \alpha \mathbf{z}, \alpha B \mathbf{z} \rangle = \alpha \langle \mathbf{z}, \alpha B \mathbf{z} \rangle = \alpha^2 \langle \mathbf{z}, B \mathbf{z} \rangle. \]

In particular, the following property of matrices will play an important role. It is defined in terms of the associated quadratic form.

DEFINITION (Positive Semidefinite Matrix) A symmetric matrix \(B \in \mathbb{R}^{n \times n}\) is positive semidefinite if

\[ \langle \mathbf{z}, B \mathbf{z} \rangle \geq 0, \quad \forall \mathbf{z} \neq \mathbf{0}. \]

We also write \(B \succeq 0\) in that case. If the inequality above is strict, we say that \(B\) is positive definite, in which case we write \(B \succ 0\). \(\natural\)

We will see an important example later in this section.

1.2.2. Differential calculus#

Next, we review some basic concepts from differential calculus. We focus here on definitions and results relevant to optimization theory, which plays a central role in data science.

Limits and continuity Throughout this section, we use the Euclidean norm \(\|\mathbf{x}\| = \sqrt{\sum_{i=1}^d x_i^2}\) for \(\mathbf{x} = (x_1,\ldots, x_d)^T \in \mathbb{R}^d\).

The open \(r\)-ball around \(\mathbf{x} \in \mathbb{R}^d\) is the set of points within Euclidean distance \(r\) of \(\mathbf{x}\), that is,

\[ B_r(\mathbf{x}) = \{\mathbf{y} \in \mathbb{R}^d \,:\, \|\mathbf{y} - \mathbf{x}\| < r\}. \]

A point \(\mathbf{x} \in \mathbb{R}^d\) is an interior point of a set \(A \subseteq \mathbb{R}^d\) if there exists an \(r > 0\) such that \(B_r(\mathbf{x}) \subseteq A\). A set \(A\) is open if it consists entirely of interior points.

Figure: Illustration of an open ball of radius \(\varepsilon\) around point \(x\). (Source)

open ball

\(\bowtie\)

A point \(\mathbf{x} \in \mathbb{R}^d\) is a limit point of a set \(A \subseteq \mathbb{R}^d\) if every open ball around \(\mathbf{x}\) contains an element \(\mathbf{a}\) of \(A\) such that \(\mathbf{a} \neq \mathbf{x}\).

A set \(A\) is closed if every limit point of \(A\) belongs to \(A\). Or, put differently, a set is closed if its complement is open.

A set \(A \subseteq \mathbb{R}^d\) is bounded if there exists an \(r > 0\) such that \(A \subseteq B_r(\mathbf{0})\), where \(\mathbf{0} = (0,\ldots,0)^T\).

DEFINITION (Limits of a Function) Let \(f: D \to \mathbb{R}\) be a real-valued function on \(D \subseteq \mathbb{R}^d\). Then \(f\) is said to have a limit \(L \in \mathbb{R}\) as \(\mathbf{x}\) approaches \(\mathbf{a}\) if: for any \(\varepsilon > 0\), there exists a \(\delta > 0\) such that \(|f(\mathbf{x}) - L| < \varepsilon\) for all \(\mathbf{x} \in D \cap B_\delta(\mathbf{a})\setminus \{\mathbf{a}\}\). This is written as

\[ \lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = L. \]

\(\natural\)

Note that we explicitly exclude \(\mathbf{a}\) itself from having to satisfy the condition \(|f(\mathbf{x}) - L| < \varepsilon\). In particular, we may have \(f(\mathbf{a}) \neq L\). We also do not restrict \(\mathbf{a}\) to be in \(D\).

DEFINITION (Continuous Function) Let \(f: D \to \mathbb{R}\) be a real-valued function on \(D \subseteq \mathbb{R}^d\). Then \(f\) is said to be continuous at \(\mathbf{a} \in D\) if

\[ \lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = f(\mathbf{a}). \]

\(\natural\)

Figure: A continuous function. E.g., in a small neighborhood around \(2\), the function varies only slightly. (Source)

A continuous function

\(\bowtie\)

We will not prove the following fundamental analysis result, which will be used repeatedly in this course. (See e.g. Wikipedia for a sketch of the proof.) Suppose \(f : D \to \mathbb{R}\) is defined on a set \(D \subseteq \mathbb{R}^d\). We say that \(f\) attains a maximum value \(M\) at \(\mathbf{z}^*\) if \(f(\mathbf{z}^*) = M\) and \(M \geq f(\mathbf{x})\) for all \(\mathbf{x} \in D\). Similarly, we say \(f\) attains a minimum value \(m\) at \(\mathbf{z}_*\) if \(f(\mathbf{z}_*) = m\) and \(m \geq f(\mathbf{x})\) for all \(\mathbf{x} \in D\).

THEOREM (Extreme Value) Let \(f : D \to \mathbb{R}\) be a real-valued, continuous function on a nonempty, closed, bounded set \(D\subseteq \mathbb{R}^d\). Then \(f\) attains a maximum and a minimum on \(D\). \(\sharp\)

Derivatives We move on to derivatives. Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable. Formally:

DEFINITION (Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\). The derivative of \(f\) at \(x_0\) is

\[ f'(x_0) = \frac{\mathrm{d} f (x_0)}{\mathrm{d} x} = \lim_{h \to 0} \frac{f(x_0 + h) - f(x_0)}{h} \]

provided the limit exists. \(\natural\)

Figure: The derivative at the red point is the slope of the line tangent the curve there. (Source)

The derivative

\(\bowtie\)

The following lemma encapsulates a key insight about the derivative of \(f\) at \(x_0\): it tells us where to find smaller values.

LEMMA (Descent Direction) Let \(f : D \to \mathbb{R}\) with \(D \subseteq \mathbb{R}\) and let \(x_0 \in D\) be an interior point of \(D\) where \(f'(x_0)\) exists. If \(f'(x_0) > 0\), then there is an open ball \(B_\delta(x_0) \subseteq D\) around \(x_0\) such that for each \(x\) in \(B_\delta(x_0)\):

(a) \(f(x) > f(x_0)\) if \(x > x_0\),

(b) \(f(x) < f(x_0)\) if \(x < x_0\).

If instead \(f'(x_0) < 0\), the opposite holds. \(\flat\)

Proof idea: Follows from the definition of the derivative by taking \(\varepsilon\) small enough that \(f'(x_0) - \varepsilon > 0\).

Proof: Take \(\varepsilon = f'(x_0)/2\). By definition of the derivative, there is \(\delta > 0\) such that

\[ f'(x_0) - \frac{f(x_0 + h) - f(x_0)}{h} < \varepsilon \]

for all \(0 < h < \delta\). Rearranging gives

\[ f(x_0 + h) > f(x_0) + [f'(x_0) - \varepsilon] h > f(x_0) \]

by our choice of \(\varepsilon\). The other direction is similar. \(\square\)

One implication of the Descent Direction Lemma is the Mean Value Theorem, which will lead us later to Taylor’s Theorem. First, an important special case:

THEOREM (Rolle) Let \(f : [a,b] \to \mathbb{R}\) be a continuous function and assume that its derivative exists on \((a,b)\). If \(f(a) = f(b)\) then there is \(a < c < b\) such that \(f'(c) = 0\). \(\sharp\)

Proof idea: Look at an extremum and use the Descent Direction Lemma to get a contradiction.

Figure: Illustration of Rolle’s theorem. (Source)

Illustration of Rolle's theorem

\(\bowtie\)

Proof: If \(f(x) = f(a)\) for all \(x \in (a, b)\), then \(f'(x) = 0\) on \((a, b)\) and we are done. So assume there is \(y \in (a, b)\) such that \(f(y) \neq f(a)\). Assume without loss of generality that \(f(y) > f(a)\) (otherwise consider the function \(-f\)). By the Extreme Value Theorem, \(f\) attains a maximum value at some \(c \in [a,b]\). By our assumption, \(a\) and \(b\) cannot be the location of the maximum and it must be that \(c \in (a, b)\).

We claim that \(f'(c) = 0\). We argue by contradiction. Suppose \(f'(c) > 0\). By the Descent Direction Lemma, there is a \(\delta > 0\) such that \(f(x) > f(c)\) for all \(x \in B_\delta(c)\), a contradiction. A similar argument holds if \(f'(c) < 0\). That concludes the proof. \(\square\)

THEOREM (Mean Value) Let \(f : [a,b] \to \mathbb{R}\) be a continuous function and assume that its derivative exists on \((a,b)\). Then there is \(a < c < b\) such that

\[ f(b) = f(a) + (b-a)f'(c), \]

or put differently

\[ \frac{f(b) - f(a)}{b-a} = f'(c). \]

\(\sharp\)

Proof idea: Apply Rolle to

\[ \phi(x) = f(x) - \left[f(a) + \frac{f(b) - f(a)}{b - a} (x-a)\right]. \]

Figure: Illustration of the mean value theorem. (Source)

Illustration of the mean value theorem. (Source)

\(\bowtie\)

Proof: Let \(\phi(x) = f(x) - f(a) - \frac{f(b) - f(a)}{b - a} (x-a)\). Note that \(\phi(a) = \phi(b) = 0\) and \(\phi'(x) = f'(x) - \frac{f(b) - f(a)}{b - a}\) for all \(x \in (a, b)\). Thus, by Rolle, there is \(c \in (a, b)\) such that \(\phi'(c) = 0\). That implies \(\frac{f(b) - f(a)}{b - a} = \phi'(c)\) and plugging into \(\phi(b)\) gives the result. \(\square\)

We will also use Taylor’s Theorem, a generalization of the Mean Value Theorem that provides a polynomial approximation to a function around a point. We will restrict ourselves to the case of a linear approximation with second-order error term, which will suffice for our purposes.

THEOREM (Taylor) Let \(f: D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}\). Suppose \(f\) has a continuous derivative on \([a,b]\) and that its second derivative exists on \((a,b)\). Then for any \(x \in [a, b]\)

\[ f(x) = f(a) + (x-a) f'(a) + \frac{1}{2} (x-a)^2 f''(\xi) \]

for some \(a < \xi < x\). \(\sharp\)

Proof idea: The Mean Value Theorem implies that there is \(a < \xi< x\) such that

\[ f(x) = f(a) + (x - a)f'(\xi). \]

One way to think of the proof of that result is the following: we constructed an affine function that agrees with \(f\) at \(a\) and \(x\), then used Rolle to express the coefficient of the linear term using \(f'\). Here we do the same with a polynomial of degree \(2\). But we now have an extra degree of freedom in choosing this polynomial. Because we are looking for a good approximation close to \(a\), we choose to make the first derivative at \(a\) also agree. Applying Rolle twice gives the claim.

Proof: Let

\[ P(t) = \alpha_0 + \alpha_1 (t-a) + \alpha_2 (t-a)^2. \]

We choose the \(\alpha_i\)’s so that \(P(a) = f(a)\), \(P'(a) = f'(a)\), and \(P(x) = f(x)\). The first two lead to the conditions

\[ \alpha_0 = f(a), \quad \alpha_1 = f'(a). \]

Let \(\phi(t) = f(t) - P(t)\). By construction \(\phi(a) = \phi(x) = 0\). By Rolle, there is a \(\xi' \in (a, x)\) such that \(\phi'(\xi') = 0\). Moreover, \(\phi'(a) = 0\). Hence we can apply Rolle again - this time to \(\phi'\) on \([a, \xi']\). It implies that there is \(\xi \in (a, \xi')\) such that \(\phi''(\xi) = 0\).

The second derivative of \(\phi\) at \(\xi\) is

\[ 0 = \phi''(\xi) = f''(\xi) - P''(\xi) = f''(\xi) - 2 \alpha_2 \]

so \(\alpha_2 = f''(\xi)/2\). Plugging into \(P\) and using \(\phi(x) = 0\) gives the claim. \(\square\)

Optimization As we mentioned before, optimization problems play a ubiquitous role in data science. Here we look at unconstrained optimization problems, that is, problems of the form:

\[ \min_{\mathbf{x} \in \mathbb{R}^d} f(\mathbf{x}) \]

where \(f : \mathbb{R}^d \to \mathbb{R}\).

Ideally, we would like to find a global minimizer to the optimization problem above.

DEFINITION (Global Minimizer) Let \(f : \mathbb{R}^d \to \mathbb{R}\). The point \(\mathbf{x}^* \in \mathbb{R}^d\) is a global minimizer of \(f\) over \(\mathbb{R}^d\) if

\[ f(\mathbf{x}) \geq f(\mathbf{x}^*), \quad \forall \mathbf{x} \in \mathbb{R}^d. \]

\(\natural\)

NUMERICAL CORNER: The function \(f(x) = x^2\) over \(\mathbb{R}\) has a global minimizer at \(x^* = 0\). Indeed, we clearly have \(f(x) \geq 0\) for all \(x\) while \(f(0) = 0\). To plot the function, we use the matplotlib package again, and specifically its function matplotlib.pyplot.plot. We also use the function numpy.linspace to create an array of evenly spaced numbers where we evaluate \(f\).

import matplotlib.pyplot as plt
x = np.linspace(-2,2,100)
y = x ** 2
plt.plot(x,y)
plt.show()
../../_images/dd190a801edd0c9f36c582913eb51b1bb44f7b82ec51883749d4e6c53f705c79.png

The function \(f(x) = e^x\) over \(\mathbb{R}\) does not have a global minimizer. Indeed, \(f(x) > 0\) but no \(x\) achieves \(0\). And, for any \(m > 0\), there is \(x\) small enough such that \(f(x) < m\). Note that \(\mathbb{R}\) is not bounded, therefore the Extreme Value Theorem does not apply here.

x = np.linspace(-2,2,100)
y = np.exp(x)
Hide code cell source
plt.plot(x,y)
plt.ylim(0,5)
plt.show()
../../_images/4594c959f93b67c1efb2a01801760876c1359f753970229b9a1bf9f56bf6c602.png

The function \(f(x) = (x+1)^2 (x-1)^2\) over \(\mathbb{R}\) has two global minimizers at \(x^* = -1\) and \(x^{**} = 1\). Indeed, \(f(x) \geq 0\) and \(f(x) = 0\) if and only \(x = x^*\) or \(x = x^{**}\).

x = np.linspace(-2,2,100)
y = ((x+1)**2) * ((x-1)**2)
Hide code cell source
plt.plot(x,y)
plt.ylim(0,5)
plt.show()
../../_images/5c73dac7d77fc284d32145f307f8c8c16b732ae5ae83e8326f28850e1cb94295.png

\(\unlhd\)

In general, finding a global minimizer and certifying that one has been found can be difficult unless some special structure is present. Therefore weaker notions of solution have been introduced.

DEFINITION (Local Minimizer) Let \(f : \mathbb{R}^d \to \mathbb{R}\). The point \(\mathbf{x}^* \in \mathbb{R}^d\) is a local minimizer of \(f\) over \(\mathbb{R}^d\) if there is \(\delta > 0\) such that

\[ f(\mathbf{x}) \geq f(\mathbf{x}^*), \quad \forall \mathbf{x} \in B_{\delta}(\mathbf{x}^*) \setminus \{\mathbf{x}^*\}. \]

If the inequality is strict, we say that \(\mathbf{x}^*\) is a strict local minimizer. \(\natural\)

In words, \(\mathbf{x}^*\) is a local minimizer if there is open ball around \(\mathbf{x}^*\) where it attains the minimum value. The difference between global and local minimizers is illustrated in the next figure.

Figure: Local and global optima. (Source)

Local and global optima

\(\bowtie\)

Local minimizers can be characterized in terms of derivatives (or more generally, gradients, their generalization to functions of several variables; we come back to them later in the course). More precisely, they provide a necessary condition.

THEOREM (First-Order Necessary Condition) Let \(f : \mathbb{R} \to \mathbb{R}\) be continuously differentiable on \(\mathbb{R}\). If \(x_0\) is a local minimizer, then \(f'(x_0) = 0\). \(\sharp\)

Proof: We argue by contradiction. Suppose that \(f'(x_0) \neq 0\). Say \(f'(x_0) > 0\) (the other case being similar). By the Descent Direction Lemma, there is a \(\delta > 0\) such that, for each \(x\) in \(B_\delta(x_0)\), \(f(x) < f(x_0)\) if \(x < x_0\). So every open ball around \(x_0\) has a point achieving a smaller value than \(f(x_0)\). Thus \(x_0\) is not a local minimizer, a contradiction. So it must be that \(f'(x_0) = 0\). \(\square\)

Functions of several variables The previous condition generalizes naturally to functions of several variables. The derivative is replaced by the gradient. We quickly review these concepts here. A more thorough review will be provided in a later chapter.

DEFINITION (Partial Derivative) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 = (x_{0,1},\ldots,x_{0,d}) \in D\) be an interior point of \(D\). The partial derivative of \(f\) at \(\mathbf{x}_0\) with respect to \(x_i\) is

\[\begin{align*} \frac{\partial f (\mathbf{x}_0)}{\partial x_i} &= \lim_{h \to 0} \frac{f(\mathbf{x}_0 + h \mathbf{e}_i) - f(\mathbf{x}_0)}{h}\\ &= \lim_{h \to 0} \frac{f(x_{0,1},\ldots,x_{0,i-1},x_{0,i} + h,x_{0,i+1},\ldots,x_{0,d}) - f(x_{0,1},\ldots,x_{0,d})}{h} \end{align*}\]

provided the limit exists. If \(\frac{\partial f (\mathbf{x}_0)}{\partial x_i}\) exists and is continuous in an open ball around \(\mathbf{x}_0\) for all \(i\), then we say that \(f\) continuously differentiable at \(\mathbf{x}_0\). \(\natural\)

DEFINITION (Gradient) Let \(f : D \to \mathbb{R}\) where \(D \subseteq \mathbb{R}^d\) and let \(\mathbf{x}_0 \in D\) be an interior point of \(D\). Assume \(f\) is continuously differentiable at \(\mathbf{x}_0\). The (column) vector

\[ \nabla f(\mathbf{x}_0) = \left(\frac{\partial f (\mathbf{x}_0)}{\partial x_1}, \ldots, \frac{\partial f (\mathbf{x}_0)}{\partial x_d}\right) \]

is called the gradient of \(f\) at \(\mathbf{x}_0\). \(\natural\)

Note that the gradient is itself a function of \(\mathbf{x}\). In fact, unlike \(f\), it is a vector-valued function.

Figure: Gradient as a function (Source)

Gradient as a function

\(\bowtie\)

We generalize the Descent Direction Lemma to the multivariable case. We first need to define what a descent direction is.

DEFINITION (Descent Direction) Let \(f : \mathbb{R}^d \to \mathbb{R}\). A vector \(\mathbf{v}\) is a descent direction for \(f\) at \(\mathbf{x}_0\) if there is \(\alpha^* > 0\) such that

\[ f(\mathbf{x}_0 + \alpha \mathbf{v}) < f(\mathbf{x}_0), \quad \forall \alpha \in (0,\alpha^*). \]

\(\natural\)

LEMMA (Descent Direction: Multivariable Version) Let \(f : \mathbb{R}^d \to \mathbb{R}\) be continuously differentiable at \(\mathbf{x}_0\) and assume that \(\nabla f(\mathbf{x}_0) \neq 0\). Then \(f\) has a descent direction at \(\mathbf{x}_0\). \(\flat\)

THEOREM (First-Order Necessary Condition) Let \(f : \mathbb{R}^d \to \mathbb{R}\) be continuously differentiable on \(\mathbb{R}^d\). If \(\mathbf{x}_0\) is a local minimizer, then \(\nabla f(\mathbf{x}_0) = \mathbf{0}\). \(\sharp\)

1.2.3. Probability#

Finally, we review a few key definitions and results from probability theory. Further concepts will be reviewed later in the course.

Expectation, variance and Chebyshev’s inequality Recall that the expectation (or mean) of a function \(h\) of a discrete random variable \(X\) taking values in \(\mathcal{X}\) is given by

\[ \mathbb{E}[h(X)] = \sum_{x \in \mathcal{X}} h(x)\,p_X(x) \]

where \(p_X(x) = \mathbb{P}[X = x]\) is the probability mass function (PMF) of \(X\). In the continuous case, we have

\[ \mathbb{E}[h(X)] = \int h(x) f_X(x)\,\mathrm{d}x \]

if \(f_X\) is the probability density function (PDF) of \(X\).

These definitions extend to functions of multiple variables by using instead the joint PMF or PDF.

We sometimes denote the expectation of \(X\) by \(\mu_X\).

Two key properties of the expectation are:

  • linearity, that is,

\[ \mathbb{E}[\alpha_1 h_1(X) + \alpha_2 h_2(Y) + \beta] = \alpha_1 \,\mathbb{E}[h_1(X)] + \alpha_2 \,\mathbb{E}[h_2(Y)] + \beta \]
  • monotonicity, that is, if \(h_1(x) \leq h_2(x)\) for all \(x\) then

\[ \mathbb{E}[h_1(X)] \leq \mathbb{E}[h_2(X)]. \]

The variance of a real-valued random variable \(X\) is

\[ \mathrm{Var}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2] \]

and its standard deviation is \(\sigma_X = \sqrt{\mathrm{Var}[X]}\). The variance does not satisfy linearity, but we have the following property

\[ \mathrm{Var}[\alpha X + \beta] = \alpha^2 \,\mathrm{Var}[X]. \]

The standard deviation is a measure of the typical deviation of \(X\) around its mean, that is, of the spread of the distribution.

A quantitative version of this statement is given by Chebyshev’s inequality.

THEOREM (Chebyshev) For a random variable \(X\) with finite variance, we have for any \(\alpha > 0\)

\[ \mathbb{P}[|X - \mathbb{E}[X]| \geq \alpha] \leq \frac{\mathrm{Var}[X]}{\alpha^2} = \left(\frac{\sigma_X}{\alpha}\right)^2. \]

\(\sharp\)

The intuition is the following: if the expected squared deviation from the mean is small, then the absolute deviation from the mean is unlikely to be large.

To formalize this we prove a more general inequality, Makov’s inequality. In words, if a non-negative random variable has a small expectation then it is unlikely to be large.

LEMMA (Markov) Let \(Z\) be a non-negative random variable with finite expectation. Then, for any \(\beta > 0\),

\[ \mathbb{P}[Z \geq \beta] \leq \frac{\mathbb{E}[Z]}{\beta}. \]

\(\flat\)

Proof idea: The quantity \(\beta \,\mathbb{P}[Z \geq \beta]\) is a lower bound on the expectation of \(Z\) restricted to the range \(\{Z\geq \beta\}\), which by non-negativity is itself lower bounded by \(\mathbb{E}[Z]\).

Proof: Formally, let \(\mathbf{1}_A\) be the indicator of the event \(A\), that is, it is the random variable that is \(1\) when \(A\) occurs and \(0\) otherwise. By definition, the expectation of \(\mathbf{1}_A\) is

\[ \mathbb{E}[A] = 0\,\mathbb{P}[\mathbf{1}_A = 0] + 1\,\mathbb{P}[\mathbf{1}_A = 1] = \mathbb{P}[A] \]

where \(A^c\) is the complement of \(A\). Hence, by linearity and monotonicity,

\[ \beta \,\mathbb{P}[Z \geq \beta] = \beta \,\mathbb{E}[\mathbf{1}_{Z \geq \beta}] = \mathbb{E}[\beta \mathbf{1}_{Z \geq \beta}] \leq \mathbb{E}[Z]. \]

Rearranging gives the claim. \(\square\)

Finally we return to the proof of Chebyshev.

Proof idea (Chebyshev): Simply apply Markov to the squared deviation of \(X\) from its mean.

Proof (Chebyshev): Let \(Z = (X - \mathbb{E}[X])^2\), which is non-negative by definition. Hence, by Markov, for any \(\beta = \alpha^2 > 0\)

\[\begin{align*} \mathbb{P}[|X - \mathbb{E}[X]| \geq \alpha] &= \mathbb{P}[(X - \mathbb{E}[X])^2 \geq \alpha^2]\\ &= \mathbb{P}[Z \geq \beta]\\ &\leq \frac{\mathbb{E}[Z]}{\beta}\\ &= \frac{\mathrm{Var}[X]}{\alpha^2} \end{align*}\]

where we used the definition of the variance in the last equality. \(\square\)

A few important remarks about Chebyshev’s inequality:

(1) We sometimes need a one-sided bound of the form

\[ \mathbb{P}[X - \mathbb{E}[X] \geq \alpha]. \]

Note the absence of absolute values compared to the two-sided form appearing in Chebyshev’s inequality. In this case, we can use the fact that the event \(\{X - \mathbb{E}[X] \geq \alpha\}\) implies a fortiori that \(\{|X - \mathbb{E}[X]| \geq \alpha\}\), so that the probability of the former is smaller than that of the latter by monotonicity, namely,

\[ \mathbb{P}[X - \mathbb{E}[X] \geq \alpha] \leq \mathbb{P}[|X - \mathbb{E}[X]| \geq \alpha]. \]

We can then use Chebyshev’s inequality on the right-hand side to obtain

\[ \mathbb{P}[X - \mathbb{E}[X] \geq \alpha] \leq \frac{\mathrm{Var}[X]}{\alpha^2}. \]

Similarly, for the same reasons, we also have

\[ \mathbb{P}[X - \mathbb{E}[X] \leq - \alpha] \leq \frac{\mathrm{Var}[X]}{\alpha^2}. \]

(2) In terms of the standard deviation \(\sigma_X = \sqrt{\mathrm{Var}[X]}\) of \(X\), the inequality can be re-written as

\[ \mathbb{P}[|X - \mathbb{E}[X]| \geq \alpha] \leq \frac{\mathrm{Var}[X]}{\alpha^2} = \left(\frac{\sigma_X}{\alpha}\right)^2. \]

So to get a small bound on the right-hand side, one needs the deviation from the mean \(\alpha\) to be significantly larger than the standard deviation. In words, a random variable is unlikely to be away from its mean by much more than its standard deviation. This observation is consistent with the interpretation of the standard deviation as the typical spread of a random variable.

Chebyshev’s inequality is particularly useful when combined with independence, as we recall next.

Independence and limit theorems Recall that discrete random variables \(X\) and \(Y\) are independent if their joint PMF factorizes, that is

\[ p_{X,Y}(x,y) = p_X(x) \,p_Y(y), \qquad \forall x, y \]

where \(p_{X,Y}(x,y) = \mathbb{P}[X=x, Y=y]\). Similarly, continuous random variables \(X\) and \(Y\) are independent if their joint PDF factorizes. One consequence is that expectations of products of single-variable functions factorize as well, that is, for functions \(g\) and \(h\) we have

\[ \mathbb{E}[g(X) h(Y)] = \mathbb{E}[g(X)] \,\mathbb{E}[h(Y)], \]

provided the expectations exist.

An important way to quantify the lack of independence of two random variables is the covariance.

DEFINITION (Covariance) The covariance of random variables \(X\) and \(Y\) with finite means and variances is defined as

\[ \mathrm{Cov}[X,Y] = \mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right]. \]

\(\natural\)

Note that, by definition, the covariance is symmetric: \(\mathrm{Cov}[X,Y] = \mathrm{Cov}[Y,X]\).

When \(X\) and \(Y\) are independent, their covariance is \(0\):

\[\begin{align*} \mathrm{Cov}[X,Y] &= \mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right]\\ &= \mathbb{E}\left[X - \mathbb{E}[X]\right]\mathbb{E}\left[Y - \mathbb{E}[Y]\right]\\ &= \left(\mathbb{E}[X] - \mathbb{E}[X]\right)\left(\mathbb{E}[Y] - \mathbb{E}[Y]\right)\\ &= 0, \end{align*}\]

where we used independence on the second line and the linearity of expectations on the third one.

A related quantity of interest in data science is the correlation coefficient which is obtained by dividing the covariance by the product of the standard deviations

\[ \rho_{X, Y} = \frac{\mathrm{Cov}[X,Y]}{\sigma_X \sigma_Y}. \]

By Cauchy-Schwarz, it lies in \([-1,1]\). (Prove it!)

The covariance leads to a useful identity for the variance of a sum of random variables.

LEMMA (Variance of a Sum) Let \(X_1,\ldots,X_n\) be random variables with finite means and variances. Then we have

\[ \mathrm{Var}[X_1 + \cdots + X_n] = \sum_{i=1} \mathrm{Var}[X_i] + 2 \sum_{i < j} \mathrm{Cov}[X_i, X_j]. \]

\(\flat\)

Proof: By definition of the variance and linearity of expectations,

\[\begin{align*} &\mathrm{Var}[X_1 + \cdots + X_n]\\ &= \mathbb{E}\left[(X_1 + \cdots + X_n - \mathbb{E}[X_1 + \cdots + X_n])^2\right]\\ &= \mathbb{E}\left[(X_1 + \cdots + X_n - \mathbb{E}[X_1] - \cdots - \mathbb{E}[X_n])^2\right]\\ &= \mathbb{E}\left[(X_1 - \mathbb{E}[X_1]) + \cdots + (X_n - \mathbb{E}[X_n]))^2\right]\\ &= \sum_{i=1}^n \mathbb{E}\left[(X_i - \mathbb{E}[X_i])^2\right] + \sum_{i \neq j} \mathbb{E}\left[(X_i - \mathbb{E}[X_i]) (X_j - \mathbb{E}[X_j])\right]. \end{align*}\]

The claim follows from the definition of the variance and covariance, and the symmetry of the covariance. \(\square\)

The previous lemma has the following important implication. If \(X_1, \ldots, X_n\) are pairwise independent, real-valued random variables, then

\[ \mathrm{Var}[X_1 + \cdots + X_n] = \mathrm{Var}[X_1] + \cdots + \mathrm{Var}[X_n]. \]

Notice that, unlike the case of the expectation, this linearity property for the variance requires independence.

Applied to the sample mean of \(n\) independent, identically distributed (i.i.d.) random variables \(X_1,\ldots,X_n\), we obtain

\[\begin{align*} \mathrm{Var} \left[\frac{1}{n} \sum_{i=1}^n X_i\right] &= \frac{1}{n^2} \sum_{i=1}^n \mathrm{Var}[X_i]\\ &= \frac{1}{n^2} n \,\mathrm{Var}[X_1]\\ &= \frac{\mathrm{Var}[X_1]}{n}. \end{align*}\]

So the variance of the sample mean decreases as \(n\) gets large, while its expectation remains the same by linearity

\[\begin{align*} \mathbb{E} \left[\frac{1}{n} \sum_{i=1}^n X_i\right] &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i]\\ &= \frac{1}{n} n \,\mathbb{E}[X_1]\\ &= \mathbb{E}[X_1]. \end{align*}\]

Together with Chebyshev’s inequality, we immediately get that the sample mean approaches its expectation in the following probabilistic sense.

THEOREM (Weak Law of Large Numbers) Let \(X_1, \ldots, X_n\) be i.i.d. For any \(\varepsilon > 0\), as \(n \to +\infty\),

\[ \mathbb{P}\left[\left|\frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}[X_1]\right| \geq \varepsilon\right] \to 0. \]

\(\sharp\)

Proof: By Chebyshev and the formulas above,

\[\begin{align*} \mathbb{P}\left[\left|\frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}[X_1]\right| \geq \varepsilon\right] &= \mathbb{P}\left[\left|\frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E} \left[\frac{1}{n} \sum_{i=1}^n X_i\right]\right| \geq \varepsilon\right]\\ &\leq \frac{\mathrm{Var}\left[\frac{1}{n} \sum_{i=1}^n X_i\right]}{\varepsilon^2}\\ &= \frac{\mathrm{Var}[X_1]}{n \varepsilon^2}\\ &\to 0 \end{align*}\]

as \(n \to +\infty\). \(\square\)

NUMERICAL CORNER: We can use simulations to confirm the Weak Law of Large Numbers. Recall that a uniform random variable over the interval \([a,b]\) has density

\[\begin{split} f_{X}(x) = \begin{cases} \frac{1}{b-a} & x \in [a,b] \\ 0 & \text{o.w.} \end{cases} \end{split}\]

We write \(X \sim \mathrm{U}[a,b]\). We can obtain a sample from \(\mathrm{U}[0,1]\) by using the function numpy.random.Generator.uniform. We must first instantiate a random number generator (RNG) with numpy.random.default_rng in Numpy. We provide a seed as an initial state for the RNG. Using the same seed again ensures reproducibility.

seed = 535
rng = np.random.default_rng(seed)
rng.uniform()
0.9836159914889122

Now we take \(n\) samples from \(\mathrm{U}[0,1]\) and compute their sample mean. We repeat \(k\) times and display the empirical distribution of the sample means using an histogram.

def lln_unif(n, k):
    sample_mean = [np.mean(rng.random(n)) for i in range(k)]
    plt.hist(sample_mean,bins=15)
    plt.xlim(0,1)
    plt.title(f'n={n}')
    plt.show()

We start with \(n=10\).

Hide code cell source
lln_unif(10, 1000)
../../_images/5a532c566f9cc7b2653e2ffc3042548b263d8067efadcaf00a5091899defe832.png

Taking \(n\) much larger leads to more concentration around the mean.

Hide code cell source
lln_unif(100, 1000)
../../_images/66de7f2b09750ae0aa4980b83d9e4ca8deaccb1c3ca12eb04135a5d5d7ffde58.png

TRY IT! Recall that the cumulative distribution function (CDF) of a random variable \(X\) is defined as

\[ F_X(z) = \mathbb{P}[X \leq z], \qquad \forall z \in \mathbb{R}. \]

a) Let \(\mathcal{Z}\) be the interval where \(F_X(z) \in (0,1)\) and assume that \(F_X\) is strictly increasing on \(\mathcal{Z}\). Let \(U \sim \mathrm{U}[0,1]\). Show that

\[ \mathbb{P}[F_X^{-1}(U) \leq z] = F_X(z). \]

b) Generate a sample from \(\mathrm{U}[a,b]\) for arbitrary \(a\), \(b\) using numpy.random.Generator.uniform and the observation in a). This is called the inverse transform sampling method. (Open in Colab)

a = -1
b = 1
X = rng.uniform()
# EDIT THIS LINE: transform X to obtain a random variable Y ~ U[a,b]

\(\unlhd\)

Random vectors and matrices A random vector \(\bX = (X_1,\ldots,X_d)\) in \(\mathbb{R}^d\) is a \(d\)-dimensional vector whose coordinates \(X_1,\ldots,X_d\) are correlated random variables, that is, they live in the same probability space.

The mean \(\bmu_\bX\) of a random vector \(\bX\) is itself a vector, whose coordinates are the means of the coordinates,

\[\begin{split} \bmu_\bX = \E[\bX] = \begin{pmatrix} \E[X_1]\\ \vdots\\ \E[X_d] \end{pmatrix}. \end{split}\]

The linearity of expectation generalizes to (check it!)

\[ \E[A \bX + \mathbf{b}] = A\,\E[\bX] + \mathbf{b} \]

for a deterministic (i.e., non-random) matrix \(A \in \mathbb{R}^{\ell \times d}\) and vector \(\mathbf{b} \in \mathbb{R}^{\ell}\).

A random matrix \(\mathbf{M} = (M_{i,j})_{i,j} \in \mathbb{R}^{\ell \times d}\) is a matrix whose entries are correlated random variables, that is, they live in the same probability space. The expectation of a random matrix is the (deterministic) matrix whose entries are the expectations of the entries of \(\mathbf{M}\)

\[\begin{split} \mathbb{E}[\mathbf{M}] = \begin{pmatrix} \E[M_{1,1}] & \cdots & \E[M_{1,d}]\\ \vdots & \ddots & \vdots\\ \E[M_{\ell,1}] & \cdots & \E[M_{\ell,d}] \end{pmatrix}. \end{split}\]

The linearity of expectation generalizes to (check it!)

\[ \E[A \mathbf{M} + B] = A\,\E[\mathbf{M}] + B \]

for a deterministic matrix \(A \in \mathbb{R}^{k \times \ell}\) and vector \(\mathbf{b} \in \mathbb{R}^{\ell \times d}\).

The covariance matrix \(\bSigma_{\bX}\) of a random vector \(\bX\) is the matrix whose \((i,j)\)-entry is the covariance of coordinates \(i\) and \(j\)

\[ (\bSigma_{\bX})_{i,j} = \mathrm{cov}[X_i,X_j] = \mathbb{E}\left[(X_i - \mathbb{E}[X_i])(X_j - \mathbb{E}[X_j])\right]. \]

Using a previous example, this can be written in a more compact matrix form as the following

\[ \bSigma_\bX = \E\left[ (\bX - \bmu_\bX) (\bX - \bmu_\bX)^T \right], \]

where we think of \(\bX\) as a column vector. Observe that in this calculation, \((\bX - \bmu_\bX) (\bX - \bmu_\bX)^T\) is a random matrix.

Covariance matrices have two special properties: they are symmetric and positive semidefinite.

DEFINITION (Symmetric Matrix) A square matrix \(B \in \mathbb{R}^{n \times n}\) is symmetric if \(B^T = B\). \(\natural\)

DEFINITION (Positive Semidefinite Matrix) A symmetric matrix \(B \in \mathbb{R}^{n \times n}\) is positive semidefinite if

\[ \langle \mathbf{z}, B \mathbf{z} \rangle \geq 0, \quad \forall \mathbf{z} \neq \mathbf{0}. \]

We also write \(B \succeq 0\) in that case. If the inequality above is strict, we say that \(B\) is positive definite, in which case we write \(B \succ 0\). \(\natural\)

It is sometimes useful to note that, in matrix form, \(\langle \mathbf{z}, B \mathbf{z} \rangle = \mathbf{z}^T B \mathbf{z}\).

Symmetry comes from the definition of the covariance:

\[\begin{align*} \mathrm{Cov}[X_i,X_j] &= \mathbb{E}\left[(X_i - \mathbb{E}[X_i])(X_j - \mathbb{E}[X_j])\right]\\ &= \mathbb{E}\left[(X_j - \mathbb{E}[X_j])(X_i - \mathbb{E}[X_i])\right]\\ &=\mathrm{Cov}[X_j,X_i]. \end{align*}\]

THEOREM (Positive Semidefiniteness of the Covariance) The covariance matrix \(\bSigma_\bX\) of a random vector \(\bX\) is positive semidefinite. \(\sharp\)

Proof idea: The expression \(\langle \mathbf{z}, \bSigma_\bX \mathbf{z} \rangle\) can be re-written as the variance of a sum of random variables. Variances are always non-negative.

Proof: By definition of the covariance,

\[\begin{align*} \langle \mathbf{z}, \bSigma_\bX \mathbf{z} \rangle &= \sum_{i,j} z_i z_j \mathrm{Cov}[X_i, X_j]\\ &= \sum_{i,j} z_i z_j \mathbb{E}\left[(X_i - \mathbb{E}[X_i])(X_j - \mathbb{E}[X_j])\right]\\ &= \sum_{i,j} \mathbb{E}\left[(z_i X_i - \mathbb{E}[ z_iX_i])(z_j X_j - \mathbb{E}[z_j X_j])\right]\\ &= \sum_{i,j} \mathrm{Cov}[z_i X_i, z_j X_j]. \end{align*}\]

Using the fact that \(\mathrm{Cov}[X, X] = \mathrm{Var}[X]\) and \(\mathrm{Cov}[X, Y] = \mathrm{Cov}[Y, X]\), this last sum can be rearranged as

\[ \sum_{i=1}^d \mathrm{Var}[z_i X_i] + 2 \sum_{i < j} \mathrm{Cov}[z_i X_i, z_j X_j]. \]

We have encountered this expression previously! By the Variance of a Sum, this is

\[ \mathrm{Var}\left[\sum_{i=1}^d z_i X_i\right], \]

which is non-negative. That concludes the proof. \(\square\)

THINK-PAIR-SHARE: Give a shorter proof of the Positive Semidefiniteness of the Covariance using the matrix form of \(\bSigma_\bX\). \(\ddagger\)

Later on, we will need covariance matrix of a linear transformation. We first note that the covariance has convenient linearity properties:

\[\begin{align*} \mathrm{Cov}[\alpha X + \beta, Y] &= \mathbb{E}\left[(\alpha X + \beta - \mathbb{E}[\alpha X + \beta])(Y - \mathbb{E}[Y])\right]\\ &= \mathbb{E}\left[(\alpha X - \mathbb{E}[\alpha X])(Y - \mathbb{E}[Y])\right]\\ &= \alpha \,\mathbb{E}\left[( X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right]\\ &= \alpha \,\mathrm{Cov}[X, Y]. \end{align*}\]

Moreover,

\[\begin{align*} \mathrm{Cov}[X + Z, Y] &= \mathbb{E}\left[(X + Z - \mathbb{E}[X + Z])(Y - \mathbb{E}[Y])\right]\\ &= \mathbb{E}\left[(X - \mathbb{E}[X] + Z - \mathbb{E}[Z])(Y - \mathbb{E}[Y])\right]\\ &= \mathbb{E}\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right] + \mathbb{E}\left[(Z - \mathbb{E}[Z])(Y - \mathbb{E}[Y])\right]\\ &= \mathrm{Cov}[X, Y] + \mathrm{Cov}[Z, Y]. \end{align*}\]

LEMMA: (Covariance of a Linear Transformation) Let \(\bX = (X_1,\ldots,X_d)\) be a random vector in \(\mathbb{R}^d\) with finite variances (i.e., \(\mathrm{Var}[X_i] < +\infty\) for all \(i\)). For a deterministic (i.e., non-random) \(A = (a_{i,j})_{i,j} \in \mathbb{R}^{\ell \times d}\), we have

\[ \mathrm{Cov}[A \mathbf{X}] = A \,\mathrm{Cov}[\mathbf{X}] \,A^T \]

\(\flat\)

Proof: (First proof) Note that \(\mathbf{Y} = A \mathbf{X}\) is a random vector in \(\mathbb{R}^{\ell}\). We compute the pairwise covariances of its entries as follows

\[\begin{align*} \mathrm{Cov}[Y_i, Y_j] &= \mathrm{Cov}\left[\sum_{k=1}^d a_{i, k} X_{k},\sum_{\ell=1}^d a_{j, \ell} X_{\ell} \right]\\ &= \sum_{k=1}^d a_{i, k} \sum_{\ell=1}^d a_{j, \ell} \,\mathrm{Cov}\left[X_{k},X_{\ell} \right]\\ &= A_{i,\cdot} \,\mathrm{Cov}[\mathbf{X}] \,(A^T)_{\cdot,j} \end{align*}\]

where we repeatedly used the properties of the covariance derived above. \(\square\)

Proof: (Second proof) Alternatively, in matrix form, letting \(\bmu_{\mathbf{X}}\) and \(\bmu_{\mathbf{Y}} = A \bmu_{\mathbf{X}}\) be respectively the means of \(\mathbf{X}\) and \(\mathbf{Y}\), we have

\[\begin{align*} \mathrm{Cov}[\mathbf{Y}] &= \mathbb{E}\left[(\mathbf{Y} - \bmu_{\mathbf{Y}})(\mathbf{Y} - \bmu_{\mathbf{Y}})^T\right]\\ &= \mathbb{E}\left[(A \mathbf{X} - A \bmu_{\mathbf{X}})(A \mathbf{X} - A \bmu_{\mathbf{X}})^T\right]\\ &= \mathbb{E}\left[A(\mathbf{X} - \bmu_{\mathbf{X}})(\mathbf{X} - \bmu_{\mathbf{X}})^T A^T\right]\\ &= A \,\mathbb{E}\left[(\mathbf{X} - \bmu_{\mathbf{X}})(\mathbf{X} - \bmu_{\mathbf{X}})^T\right] A^T\\ \end{align*}\]

where we used linearity of expectation twice. \(\square\)

For two random vectors \(\mathbf{X} \in \mathbb{R}^{n}\) and \(\mathbf{Y} \in \mathbb{R}^m\) defined on the same probability space, we define the cross covariance matrix as

\[ \mathrm{Cov}[\mathbf{X}, \mathbf{Y}] = \E[(\mathbf{X} - \E[\mathbf{X}])(\mathbf{Y} - \E[\mathbf{Y}])^T]. \]

This is a matrix of dimension \(n \times m\).

The cross covariance matrix of a random vector with itself is the covariance matrix.

Normal distribution Recall that a standard Normal variable \(X\) has PDF

\[ f_X(x) = \frac{1}{\sqrt{2 \pi}} \exp\left( - x^2/2 \right). \]

Its mean is \(0\) and its variance is \(1\).

NUMERICAL CORNER: We plot the PDF of a standard normal distribution. We use the function scipy.stats.norm from the SciPy library, which outputs the PDF. The following code was adapted from here with the help of ChatGPT.

from scipy.stats import norm
Hide code cell source
# Plot the normal distribution curve
x = np.linspace(-4, 4, 100)
y = norm.pdf(x)
plt.plot(x, y, color='black')

# Fill areas under the curve for different standard deviations
plt.fill_between(x, y, where=(x > -1) & (x < 1), color='red', alpha=0.25)
plt.fill_between(x, y, where=(x > -2) & (x < 2), color='red', alpha=0.25)
plt.hlines(norm.pdf(1), -1, 1, color='black', linestyle='dashed')
plt.hlines(norm.pdf(2), -2, 2, color='black', linestyle='dashed')
plt.text(0, norm.pdf(1) + 0.01, "68.3%", ha='center')
plt.text(0, norm.pdf(2) + 0.01, "95.4%", ha='center')

# Set labels, title, and xticks
plt.ylabel("PDF")
plt.xticks(range(-4, 5, 2), [f'{i}' for i in range(-4, 5, 2)])
plt.show()
../../_images/7cda1e5186fd49c6674fa601e88748dc91bcef71128ee3c7cb2045c05972aa31.png

\(\unlhd\)

To construct a \(d\)-dimensional version, we take \(d\) independent standard Normal variables \(X_1, X_2, \ldots, X_d\) and form the vector \(\mathbf{X} = (X_1,\ldots,X_d)\). We will say that \(\mathbf{X}\) is a standard Normal \(d\)-vector. By independence, its joint PDF is given by the product of the PDFs of the \(X_i\)’s, that is,

\[\begin{align*} f_{\mathbf{X}}(\mathbf{x}) &= \prod_{i=1}^d \frac{1}{\sqrt{2 \pi}} \exp\left( - x_i^2/2 \right)\\ &= \frac{1}{\prod_{i=1}^d \sqrt{2 \pi}} \exp\left( - \sum_{i=1}^d x_i^2/2 \right)\\ &= \frac{1}{(2 \pi)^{d/2}} \exp(-\|\mathbf{x}\|^2/2). \end{align*}\]

We can also shift and scale it.

DEFINITION (Spherical Gaussian) Let \(\mathbf{Z}\) be a standard Normal \(d\)-vector, let \(\bmu \in \mathbb{R}^d\) and let \(\sigma \in \mathbb{R}_+\). Then we will refer to the transformed random variable \(\mathbf{X} = \bmu + \sigma \mathbf{Z}\) as a spherical Gaussian with mean \(\bmu\) and variance \(\sigma^2\). We use the notation \(\mathbf{Z} \sim N_d(\bmu, \sigma^2 I)\). \(\natural\)

NUMERICAL CORNER: The following function generates \(n\) data points from a spherical \(d\)-dimensional Gaussians with variance \(\sigma^2\) and mean \(\bmu\).

Below, rng.normal(0,1,(n,d)) generates a n independent d-dimensional spherical Gaussian with mean \(\mathbf{0}\) (as row vectors).

def spherical_gaussian(d, n, mu, sig):
    X = mu + sig * rng.normal(0,1,(n,d))
    return X

We generate \(100\) data points in dimension \(d=2\). We take \(\sigma^2 = 1\) and \(\bmu = w \mathbf{e}_1\). Below we use the function numpy.concatenate to create a vector by concatenating two given vectors. We use [w] to create a vector with a single entry w. We also use the function numpy.zeros to create an all-zero vector.

d, n, w = 2, 100, 3.
sig = 1
mu = np.concatenate(([w], np.zeros(d-1)))
X = spherical_gaussian(d, n, mu, sig)
Hide code cell source
plt.scatter(X[:,0], X[:,1], marker='x')
plt.axis('equal')
plt.show()
../../_images/95771950dc167be4fa17ef2be4a189fc28b4c90119065ca0f698a69e0833be54.png

\(\unlhd\)

More generally, we consider mixtures of spherical Gaussians, a special case of the Gaussian Mixture Model (GMM). To keep things simple, we will restrict ourselves to mixtures of \(2\) Gaussians, but this can easily be generalized.

This model has a number of parameters. For \(i=0,1\), we have a mean \(\bmu_i \in \mathbb{R}^d\) and a positive variance \(\sigma_i \in \mathbb{R}^{d \times d}\). We also have mixture weights \(\phi_0, \phi_1 \in (0,1)\) such that \(\phi_0 + \phi_1 = 1\). Suppose we want to generate a total of \(n\) samples.

For each sample \(j=1,\ldots, n\), independently from everything else:

  1. We first pick a component \(i \in \{0,1\}\) at random according to the mixture weights, that is, \(i=0\) is chosen with probability \(\phi_0\) and \(i=1\) is chosen with probability \(\phi_1\).

  2. We generate a sample \(\bX_j = (X_{j,1},\ldots,X_{j,d})\) according to a spherical Gaussian with mean \(\bmu_i\) and variance \(\sigma_i^2\).

NUMERICAL CORNER: This is straightforward to implement by using numpy.random.Generator.choice to choose the component of each sample.

The code is the following. It returns an d by n array X, where each row is a sample from a 2-component spherical Gaussian mixture.

def gmm2spherical(d, n, phi0, phi1, mu0, sig0, mu1, sig1):
    
    # merge components into matrices
    phi = np.stack((phi0, phi1))
    mu = np.stack((mu0, mu1))
    sig = np.stack((sig0,sig1))
    
    # initialization
    X = np.zeros((n,d))
    
    # choose components of each data point, then generate samples
    component = rng.choice(2, size=n, p=phi)
    for i in range(n):
        X[i,:] = spherical_gaussian(
            d, 1, mu[component[i],:], sig[component[i]])
    
    return X

Let us try it with following parameters.

d, n, w = 2, 1000, 3.
phi0 = 0.1
phi1 = 0.9
mu0 = np.concatenate(([w], np.zeros(d-1)))
mu1 = np.concatenate(([-w], np.zeros(d-1)))
sig0 = 1
sig1 = 0.5
X = gmm2spherical(d, n, phi0, phi1, mu0, sig0, mu1, sig1)
Hide code cell source
plt.scatter(X[:,0], X[:,1], marker='x')
plt.axis('equal')
plt.show()
../../_images/395cdeb7ebff993a6bc71049a4a32b27a085daf946eac7de5f832a905df5c41c.png

As expected, we observe two clusters. The one on the right (component \(0\)) is sparser (i.e., it contains fewer data points) since phi0 is much smaller than phi1. It is also larger as its variance is larger.

\(\unlhd\)