More advanced material

\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\S}{\mathcal{S}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\indep}{\perp\!\!\!\perp}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\)

7.6. More advanced material#

In this section, we cover some more advanced material.

7.6.1. Cholesky decomposition#

In this section, we derive an important notion of matrix factorization.

Main theorem Our key linear-algebraic result of this section is the following. The matrix factorization in the next theorem is called a Cholesky decomposition. It has many applications.

THEOREM (Cholesky Decomposition) Any positive definite matrix \(B \in \mathbb{R}^{n \times n}\) can be factorized uniquely as

\[ B = L L^T \]

where \(L \in \mathbb{R}^{n \times n}\) is a lower triangular matrix with positive entries on the diagonal. \(\sharp\)

The proof is provided in a section below. An important consequence of the proof is an algorithm for computing the Cholesky decomposition. Indeed it follows from the equations in the proof that we can grow \(L\) starting from its top-left corner by successively computing its next row based on the previously constructed submatrix. Note that, because \(L\) is lower triangular, it suffices to compute its elements on and below the diagonal.

Figure: Access pattern (Source)

Access pattern

\(\bowtie\)

EXAMPLE: Before proceeeding with the general method, we give a small example to provide some intuition as to how it operates. Consider the matrix

\[\begin{split} A = \begin{pmatrix} 1 & 2 & 1\\ 0 & -2 & 1\\ 0 & 0 & 1\\ 0 & 0 & 0 \end{pmatrix}. \end{split}\]

We will need the following lemma first.

LEMMA (Least Squares and Postive Semidefiniteness) Let \(A \in \mathbb{R}^{n\times m}\) be an \(n\times m\) matrix with \(n \geq m\). The matrix \(B = A^T A\) is positive semidefinite. If further the columns of \(A\) are linearly independent, then the matrix \(B\) is positive definite. \(\flat\)

Proof: For any \(\mathbf{x}\),

\[ \mathbf{x}^T (A^T A) \mathbf{x} = (A \mathbf{x})^T (A \mathbf{x}) = \|A \mathbf{x}\|^2 \geq 0. \]

Hence \(B \succeq 0\). If the inequality above is an equality, by the point-separating property of the Euclidean norm, that means that we must have \(A \mathbf{x} = \mathbf{0}\). If \(A\) has full column rank, the Equivalent Definition of Linear Independence implies that \(\mathbf{x} = \mathbf{0}\), which establishes the second claim. \(\square\)

By the Least Squares and Postive Semidefiniteness Lemma, the matrix

\[\begin{split} A^T A = \begin{pmatrix} 1 & 2 & 1\\ 2 & 8 & 0\\ 1 & 0 & 3 \end{pmatrix} \end{split}\]

is positive semidefinite. In fact, the columns of \(A\) are linearly independent since \(A \mathbf{x} = \mathbf{0}\) can be written as

\[\begin{align*} x_1 + 2x_2 + x_3 &= 0\\ -2 x_2 + x_3 &= 0\\ x_3 &= 0 \end{align*}\]

whose solution is \(x_3 = x_2 = x_1 = 0\) by backward substitution, i.e., \(\mathbf{x} = \mathbf{0}\). Hence \(A^T A\) is positive definite by the Least Squares and Postive Semidefiniteness Lemma.

Let \(L = (\ell_{i,j})_{i,j=1}^3\) be lower triangular. We seek to solve \(L L^T = A^T A\) for the nonzero entries of \(L\). Observe that

\[\begin{split} \begin{pmatrix} \ell_{1,1} & 0 & 0\\ \ell_{2,1} & \ell_{2,2} & 0\\ \ell_{3,1} & \ell_{3,2} & \ell_{3,3} \end{pmatrix} \begin{pmatrix} \ell_{1,1} & \ell_{2,1} & \ell_{3,1}\\ 0 & \ell_{2,2} & \ell_{3,2}\\ 0 & 0 & \ell_{3,3} \end{pmatrix} = \begin{pmatrix} \ell_{1,1}^2 & \ell_{1,1}\ell_{2,1} & \ell_{1,1}\ell_{3,1}\\ \ell_{1,1}\ell_{2,1} & \ell_{2,1}^2 + \ell_{2,2}^2 & \ell_{2,1}\ell_{3,1} + \ell_{2,2}\ell_{3,2}\\ \ell_{1,1}\ell_{3,1} & \ell_{2,1}\ell_{3,1} + \ell_{2,2}\ell_{3,2} & \ell_{3,1}^2 + \ell_{3,2}^2 + \ell_{3,3} \end{pmatrix}. \end{split}\]

The system

\[\begin{split} \begin{pmatrix} \ell_{1,1}^2 & \ell_{1,1}\ell_{2,1} & \ell_{1,1}\ell_{3,1}\\ \ell_{1,1}\ell_{2,1} & \ell_{2,1}^2 + \ell_{2,2}^2 & \ell_{2,1}\ell_{3,1} + \ell_{2,2}\ell_{3,2}\\ \ell_{1,1}\ell_{3,1} & \ell_{2,1}\ell_{3,1} + \ell_{2,2}\ell_{3,2} & \ell_{3,1}^2 + \ell_{3,2}^2 + \ell_{3,3}^2 \end{pmatrix} = \begin{pmatrix} 1 & 2 & 1\\ 2 & 8 & 0\\ 1 & 0 & 3 \end{pmatrix} \end{split}\]

turns out to be fairly simple to solve.

  1. From the first entry, we get \(\ell_{1,1} = 1\) (where we took the positive solution to \(\ell_{1,1}^2 = 1\)).

  2. Given that \(\ell_{1,1}\) is known, entry \(\ell_{2,1}\) is determined from \(\ell_{1,1}\ell_{2,1} =2\) in the first entry of the second row. That is, \(\ell_{2,1} =2\). Then the second entry of the second row gives \(\ell_{2,2}\) through \(\ell_{2,1}^2 + \ell_{2,2}^2 = 8\). So \(\ell_{2,2} = 2\) (again we take the positive solution).

  3. We move to the third row. The first entry gives \(\ell_{3,1} = 1\), the second entry gives \(\ell_{3,2} = -1\) and finally the third entry leads to \(\ell_{3,3} = 1\).

\(\lhd\)

To detail the computation of the Cholesky decomposition \(L L^T\) of \(B\), we will need some notation. Write \(B = (b_{i,j})_{i,j=1}^n\) and \(L = (\ell_{i,j})_{i,j=1}^n\). Let \(L_{(k)} = (\ell_{i,j})_{i,j=1}^k\) be the first \(k\) rows and columns of \(L\), let \(\bflambda_{(k)}^T = (\ell_{k,1},\ldots,\ell_{k,k-1})\) be the row vector corresponding to the first \(k-1\) entries of row \(k\) of \(L\), and let \(\bfbeta_{(k)}^T = (b_{k,1},\ldots,b_{k,k-1})\) be the row vector corresponding to the first \(k-1\) entries of row \(k\) of \(B\).

The strategy is to compute \(L_{(1)}\), then \(L_{(2)}\), then \(L_{(3)}\) and so on. With the notation above, \(L_{(j)}\) can be written in block form as

\[\begin{split} L_{(j)} = \begin{pmatrix} L_{(j-1)} & \mathbf{0}\\ \bflambda_{(j)}^T & \ell_{j,j}. \end{pmatrix} \end{split}\]

Hence, once \(L_{(j-1)}\) is known, in order to compute \(L_{(j)}\) one only needs \(\bflambda_{(j)}\) and \(\ell_{j,j}\). We show next that they satisfy easily solvable systems of equations.

We first note that the \((1,1)\) entry of the matrix equation \(L L^T = B\) implies that

\[ \ell_{1,1}^2 = b_{1,1}. \]

So we set

\[ L_{(1)} = \ell_{1,1} = \sqrt{b_{1,1}}. \]

For this step to be well-defined, it needs to be the case that \(b_{1,1} > 0\). It is easy to see that it follows from the positive definiteness of \(B\):

\[ 0 < \langle \mathbf{e}_1, B \mathbf{e}_1\rangle = \mathbf{e}_1^T B_{\cdot,1} = b_{1,1}. \]

Proceeding by induction, assume \(L_{(j-1)}\) has been constructed. The first \(j-1\) elements of the \(j\)-th row of the matrix equation \(L L^T = B\) translate into

\[ L_{j,\cdot} (L^T)_{\cdot,1:j-1} = \bflambda_{(j)}^T L_{(j-1)}^T = \bfbeta_{(j)}^T, \]

where \((L^T)_{\cdot,1:j-1}\) denotes the first \(j-1\) columns of \(L^T\). In the first equality above, we used the fact that \(L^T\) is upper triangular. Taking a transpose, the resulting linear system of equations

\[ L_{(j-1)} \bflambda_{(j)} = \bfbeta_{(j)} \]

can be solved by forward substitution (since \(\bfbeta_{(j)}\) is part of the input and \(L_{(j-1)}\) was previously computed). The fact that this system has a unique solution (more specifically, that the diagonal entries of \(L_{(j-1)}\) are strictly positive) is established in the proof of the Cholesky Decomposition Theorem.

The \((j,j)\)-th entry of the matrix equation \(L L^T = B\) translates into

\[ L_{j,\cdot} (L^T)_{\cdot,j} = \sum_{k=1}^j \ell_{j,k}^2 = b_{j,j}, \]

where again we used the fact that \(L^T\) is upper triangular. Since \(\ell_{j,1}, \ldots, \ell_{j,j-1}\) are the elements of \(\bflambda_{(j)}\), they have already been determined. So we can set

\[ \ell_{j,j} = \sqrt{b_{j,j} - \sum_{k=1}^{j-1} \ell_{j,k}^2}. \]

The fact that we are taking the square root of a positive quantity is established in the proof of the Cholesky Decomposition Theorem. Finally, from \(L_{(j-1)}\), \(\bflambda_{(j)}\), and \(\ell_{j,j}\), we construct \(L_{(j)}\).

NUMERICAL CORNER: We implement the algorithm above. In our naive implementation, we assume that \(B\) is positive definite, and therefore that all steps are well-defined.

def cholesky(B):
    n = B.shape[0] 
    L = np.zeros((n, n))
    for j in range(n):
        L[j,0:j] = mmids.forwardsubs(L[0:j,0:j],B[j,0:j])
        L[j,j] = np.sqrt(B[j,j] - LA.norm(L[j,0:j])**2)
    return L 

Here is a simple example.

B = np.array([[2., 1.],[1., 2.]])
print(B)
[[2. 1.]
 [1. 2.]]
L = cholesky(B)
print(L)
[[1.41421356 0.        ]
 [0.70710678 1.22474487]]

We can check that it produces the right factorization.

print(L @ L.T)
[[2. 1.]
 [1. 2.]]

\(\unlhd\)

Proof of Cholesky decomposition theorem We give a proof of the Cholesky Decomposition Theorem.

Proof idea: Assuming by induction that the upper-left corner of the matrix \(B\) has a Cholesky decomposition, one finds equations for the remaining row that can be solved uniquely by the properties established in the previous subsection.

Proof: If \(n=1\), we have shown previously that \(b_{1,1} > 0\), and hence we can take \(L = [\ell_{1,1}]\) where \(\ell_{1,1} = \sqrt{b_{1,1}}\). Assuming the result holds for positive definite matrices in \(\mathbb{R}^{(n-1) \times (n-1)}\), we first re-write \(B = L L^T\) in block form

\[\begin{split} \begin{pmatrix} B_{11} & \bfbeta_{12}\\ \bfbeta_{12}^T & \beta_{22} \end{pmatrix} = \begin{pmatrix} \Lambda_{11} & \mathbf{0}\\ \bflambda_{12}^T & \lambda_{22} \end{pmatrix} \begin{pmatrix} \Lambda_{11}^T & \bflambda_{12}\\ \mathbf{0}^T & \lambda_{22} \end{pmatrix} \end{split}\]

where \(B_{11}, \Lambda_{11} \in \mathbb{R}^{n-1 \times n-1}\), \(\bfbeta_{12}, \bflambda_{12} \in \mathbb{R}^{n-1}\) and \(\beta_{22}, \lambda_{22} \in \mathbb{R}\). By block matrix algebra, we get the system

\[\begin{split} B_{11} = \Lambda_{11} \Lambda_{11}^T\\ \bfbeta_{12} = \Lambda_{11} \bflambda_{12}\\ \beta_{22} = \bflambda_{12}^T \bflambda_{12} + \lambda_{22}^2. \end{split}\]

By the Principal Submatrices Lemma, the principal submatrix \(B_{11}\) is positive definite. Hence, by induction, there is a unique lower-triangular matrix \(\Lambda_{11}\) with positive diagonal elements satisfying the first equation. We can then obtain \(\bfbeta_{12}\) from the second equation by forward substitution. And finally we get

\[ \lambda_{22} = \sqrt{\beta_{22} - \bflambda_{12}^T \bflambda_{12}}. \]

We do have to check that the square root above exists. That is, we need to argue that the expression inside the square root is non-negative. In fact, for the claim to go through, we need it to be strictly positive. We notice that the expression inside the square root is in fact the Schur complement of the block \(B_{11}\):

\[\begin{align*} \beta_{22} - \bflambda_{12}^T \bflambda_{12} &= \beta_{22} - (\Lambda_{11}^{-1} \bfbeta_{12})^T (\Lambda_{11}^{-1} \bfbeta_{12})\\ &= \beta_{22} - \bfbeta_{12}^T (\Lambda_{11}^{-1})^T \Lambda_{11}^{-1} \bfbeta_{12}\\ &= \beta_{22} - \bfbeta_{12}^T (\Lambda_{11} \Lambda_{11}^T)^{-1} \bfbeta_{12}\\ &= \beta_{22} - \bfbeta_{12}^T (B_{11})^{-1} \bfbeta_{12} \end{align*}\]

where we used the equation \(\bfbeta_{12} = \Lambda_{11} \bflambda_{12}\) on the first line, the identities \((Q W)^{-1} = W^{-1} Q^{-1}\) and \((Q^T)^{-1} = (Q^{-1})^T\) (see the exercise below) on the third line and the equation \(B_{11} = \Lambda_{11} \Lambda_{11}^T\) on the fourth line. By the Schur Complement Lemma, the Schur complement is positive definite. Because it is a scalar in this case, it is strictly positive (prove it!), which concludes the proof. \(\square\)

Using a Cholesky decomposition to solve the least squares problem In this section, we restrict ourselves to the case where \(A \in \mathbb{R}^{n\times m}\) has full column rank. By the Least Squares and Positive Semidefiniteness Lemma, we then have that \(A^T A\) is positive definite. By the Cholesky Decomposition Theorem, we can factorize this matrix as \(A^T A = L L^T\) where \(L\) is lower triangular with positive diagonal elements. The normal equations then reduce to

\[ L L^T \mathbf{x} = A^T \mathbf{b}. \]

This system can be solved in two steps. We first obtain the solution to

\[ L \mathbf{z} = A^T \mathbf{b} \]

by forward substitution. Then we obtain the solution to

\[ L^T \mathbf{x} = \mathbf{z} \]

by back-substitution. Note that \(L^T\) is indeed an upper triangular matrix.

NUMERICAL CORNER: We implement this algorithm below. In our naive implementation, we assume that \(A\) has full column rank, and therefore that all steps are well-defined.

def ls_by_chol(A, b):
    L = cholesky(A.T @ A)
    z = mmids.forwardsubs(L, A.T @ b)
    return mmids.backsubs(L.T, z)

\(\unlhd\)

Other applications of the Cholesky decomposition are briefly described here.