\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\S}{\mathcal{S}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\btheta}{\boldsymbol{\theta}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\indep}{\perp\!\!\!\perp}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\bu}{\mathbf{u}}\) \(\newcommand{\bU}{\mathbf{U}}\)

7.4. Linear-Gaussian models and Kalman filtering#

In this section, we illustrate the use of linear-Gaussian models for object tracking. We first give some background.

7.4.1. Multivariate Gaussians: marginals and conditionals#

We will need the marginal and conditional densities of a multivariate Gaussian. We first need various formulas and results about block matrices.

Block matrices It will be convenient to introduce block matrices. First recall that, for a vector \(\mathbf{x} \in \mathbb{R}^n\), we write \(\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2)\), where \(\mathbf{x}_1 \in \mathbb{R}^{n_1}\) and \(\mathbf{x}_2 \in \mathbb{R}^{n_2}\) with \(n_1 + n_2 = n\), to indicate that \(\mathbf{x}\) is partitioned into two blocks: \(\mathbf{x}_1\) corresponds to the first \(n_1\) coordinates of \(\mathbf{x}\) while \(\mathbf{x}_2\) corresponds to the following \(n_2\) coordinates.

More generally, a block matrix is a partitioning of the rows and columns of a matrix of the form

\[\begin{split} A = \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix} \end{split}\]

where \(A \in \mathbb{R}^{n \times m}\), \(A_{ij} \in \mathbb{R}^{n_i \times m_j}\) for \(i,j = 1, 2\) with the conditions \(n_1 + n_2 = n\) and \(m_1 + m_2 = m\). One can also consider larger numbers of blocks, but it will not be necessary here.

Block matrices have a convenient algebra that mimics the usual matrix algebra. Specifically, if \(B_{ij} \in \mathbb{R}^{m_i \times p_j}\) for \(i,j = 1, 2\), then it holds that

\[\begin{split} \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12}\\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22}\\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{pmatrix}. \end{split}\]

Observe that the block sizes of \(A\) and \(B\) must match for this formula to make sense. You can convince yourself of this identity by trying it on a simple example. For a proof sketch, see e.g. here. Warning: while the formula is similar to the usual matrix product, the order of multiplication matters because the blocks are matrices and they do not in general commute!

Consider a square block matrix with the same partitioning of the rows and columns, that is,

\[\begin{split} A = \begin{pmatrix} A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix} \end{split}\]

where \(A \in \mathbb{R}^{n \times n}\), \(A_{ij} \in \mathbb{R}^{n_i \times n_i}\) for \(i = 1, 2\) with the condition \(n_1 + n_2 = n\). Then it is straightforward to check (try it!) that the transpose can be written as

\[\begin{split} A^T = \begin{pmatrix} A_{11}^T & A_{21}^T\\ A_{12}^T & A_{22}^T \end{pmatrix}. \end{split}\]

In particular, if \(A\) is symmetric then \(A_{11} = A_{11}^T\), \(A_{22} = A_{22}^T\) and \(A_{21} = A_{12}^T\).

EXAMPLE: The formula above applies to partitioned vectors as well by taking for instance \(B_{11}\) and \(B_{22}\) to be empty blocks. For instance, consider the \(\mathbf{z} = (\mathbf{z}_1, \mathbf{z}_2)\), where \(\mathbf{z}_1 \in \mathbb{R}^{n_1}\) and \(\mathbf{z}_2 \in \mathbb{R}^{n_2}\). We want to compute the quadratic form

\[\begin{split} \mathbf{z}^T \begin{pmatrix} A_{11} & A_{12}\\ A_{12}^T & A_{22} \end{pmatrix} \mathbf{z} \end{split}\]

where \(A_{ij} \in \mathbb{R}^{n_i \times n_j}\) for \(i,j = 1, 2\) with the conditions \(n_1 + n_2 = n\), and \(A_{11}^T = A_{11}\) and \(A_{22}^T = A_{22}\).

We apply the formula above twice to get

\[\begin{align*} &(\mathbf{z}_1, \mathbf{z}_2)^T \begin{pmatrix} A_{11} & A_{12}\\ A_{12}^T & A_{22} \end{pmatrix} (\mathbf{z}_1, \mathbf{z}_2)\\ &= (\mathbf{z}_1, \mathbf{z}_2)^T \begin{pmatrix} A_{11} \mathbf{z}_1 + A_{12} \mathbf{z}_2\\ A_{12}^T \mathbf{z}_1 + A_{22} \mathbf{z}_2 \end{pmatrix}\\ &= \mathbf{z}_1^T A_{11} \mathbf{z}_1 + \mathbf{z}_1^T A_{12} \mathbf{z}_2 + \mathbf{z}_2^T A_{12}^T \mathbf{z}_1 + \mathbf{z}_2^T A_{22} \mathbf{z}_2\\ &= \mathbf{z}_1^T A_{11} \mathbf{z}_1 + 2 \mathbf{z}_1^T A_{12} \mathbf{z}_2 + \mathbf{z}_2^T A_{22} \mathbf{z}_2. \end{align*}\]

\(\lhd\)

EXAMPLE: Let \(A_{ii} \in \mathbb{R}^{n_i \times n_i}\) for \(i = 1, 2\) be invertible. We claim that

\[\begin{split} \begin{pmatrix} A_{11} & \mathbf{0}\\ \mathbf{0} & A_{22} \end{pmatrix}^{-1} = \begin{pmatrix} A_{11}^{-1} & \mathbf{0}\\ \mathbf{0} & A_{22}^{-1} \end{pmatrix}. \end{split}\]

The matrix on the right-hand side is well-defined by the invertibility of \(A_{11}\) and \(A_{22}\). We check the claim using the formula for matrix products of block matrices. The matrices above are block diagonal.

Indeed, we obtain

\[\begin{align*} &\begin{pmatrix} A_{11} & \mathbf{0}\\ \mathbf{0} & A_{22} \end{pmatrix} \begin{pmatrix} A_{11}^{-1} & \mathbf{0}\\ \mathbf{0} & A_{22}^{-1} \end{pmatrix}\\ &= \begin{pmatrix} A_{11} A_{11}^{-1} + \mathbf{0} & \mathbf{0} + \mathbf{0}\\ \mathbf{0} + \mathbf{0} & \mathbf{0} + A_{22} A_{22}^{-1} \end{pmatrix} = \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix} = I_{n \times n} \end{align*}\]

and similarly for the other direction. \(\lhd\)

EXAMPLE: Let \(A_{21} \in \mathbb{R}^{n_2 \times n_1}\). Then we claim that

\[\begin{split} \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ A_{21} & I_{n_2 \times n_2} \end{pmatrix}^{-1} = \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - A_{21} & I_{n_2 \times n_2} \end{pmatrix}. \end{split}\]

A similar formula holds for the block upper triangular case. In particular, such matrices are invertible (which can be proved in other ways, for instance through determinants).

It suffices to check:

\[\begin{align*} &\begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ A_{21} & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ -A_{21} & I_{n_2 \times n_2} \end{pmatrix}\\ &= \begin{pmatrix} I_{n_1 \times n_1} I_{n_1 \times n_1} + \mathbf{0} & \mathbf{0} + \mathbf{0}\\ A_{21} I_{n_1 \times n_1} + (-A_{21}) I_{n_1 \times n_1} & \mathbf{0} + I_{n_2 \times n_2} I_{n_2 \times n_2} \end{pmatrix} = \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix} = I_{n \times n} \end{align*}\]

Taking a transpose gives a similar formula for the block upper triangular case. \(\lhd\)

Schur complement We will need the concept of a Schur complement. We start with a couple observations about positive definite matrices.

Recall:

LEMMA (Invertibility of Positive Definite Matrices) Let \(B \in \mathbb{R}^{n \times n}\) be positive definite. Then \(B\) is invertible. \(\flat\)

Proof: For any \(\mathbf{x} \neq \mathbf{0}\), it holds by positive definiteness that \(\mathbf{x}^T B \mathbf{x} > 0\). In particular, it must be that \(\mathbf{x}^T B \mathbf{x} \neq 0\) and therefore, by contradiction, \(B \mathbf{x} \neq \mathbf{0}\) (since for any \(\mathbf{z}\), it holds that \(\mathbf{z}^T \mathbf{0} = 0\)). The claim follows from the Equivalent Definition of Linear Independence. \(\square\)

A principal submatrix is a square submatrix obtained by removing some rows and columns. Moreover we require that the set of row indices that remain is the same as the set of column indices that remain.

LEMMA (Principal Submatrices) Let \(B \in \mathbb{R}^{n \times n}\) be positive definite and let \(Z \in \mathbb{R}^{n \times p}\) have full column rank. Then \(Z^T B Z\) is positive definite. In particular all principal submatrices of positive definite matrices are positive definite. \(\flat\)

Proof: If \(\mathbf{x} \neq \mathbf{0}\), then \(\mathbf{x}^T (Z^T B Z) \mathbf{x} = \mathbf{y}^T B \mathbf{y}\), where we defined \(\mathbf{y} = Z \mathbf{x}\). Because \(Z\) has full column rank and \(\mathbf{x} \neq \mathbf{0}\), it follows that \(\mathbf{y} \neq \mathbf{0}\) by the Equivalent Definition of Linear Independence. Hence, since \(B \succ 0\), we have \(\mathbf{y}^T B \mathbf{y} > 0\) which proves the first claim. For the second claim, take \(Z\) of the form \((\mathbf{e}_{m_1}\ \mathbf{e}_{m_2}\ \ldots\ \mathbf{e}_{m_p})\), where the indices \(m_1, \ldots, m_p\) are distinct and increasing. The columns of \(Z\) are then linearly independent since they are distinct basis vectors. \(\square\)

To better understand the last claim in the proof, note that

\[ (Z^T B Z)_{i,j} = (Z^T)_{i,\cdot} B Z_{\cdot,j} = (Z_{\cdot,i})^T B Z_{\cdot,j} = \sum_{k=1}^n \sum_{\ell=1}^n Z_{k,i} B_{k,\ell} Z_{\ell,j}. \]

So if the \(i\)-th column of \(Z\) is \(\mathbf{e}_{m_i}\) and the \(j\)-th column of \(Z\) is \(\mathbf{e}_{m_j}\), then the rightmost summation picks up only one element, \(B_{m_i, m_j}\). In other words, \(Z^T B Z\) is the principal submatrix of \(B\) corresponding to rows and columns \(m_1, \ldots, m_p\).

MULTIPLE CHOICE: Consider the matrices

\[\begin{split} A = \begin{pmatrix} 1 & 3 & 0 & 2\\ 2 & 8 & 4 & 0\\ 6 & 1 & 1 & 4\\ 3 & 2 & 0 & 1 \end{pmatrix} \qquad\text{and}\qquad Z = \begin{pmatrix} 0 & 0\\ 1 & 0\\ 0 & 0\\ 0 & 1 \end{pmatrix} \end{split}\]

Which of the following matrices is \(Z^T A Z\)?

a) \(\begin{pmatrix} 1 & 0\\ 6 & 1 \end{pmatrix}\)

b) \(\begin{pmatrix} 8 & 0\\ 2 & 1 \end{pmatrix}\)

c) \(\begin{pmatrix} 2 & 8 & 4 & 0\\ 3 & 2 & 0 & 1 \end{pmatrix}\)

d) \(\begin{pmatrix} 1 & 0\\ 2 & 4\\ 6 & 1\\ 3 & 0 \end{pmatrix}\)

\(\ddagger\)

LEMMA (Schur Complement) Let \(B \in \mathbb{R}^{n \times n}\) be positive definite and write it in block form

\[\begin{split} B = \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix} \end{split}\]

where \(B_{11} \in \mathbb{R}^{n_1 \times n_1}\) and \(B_{22} \in \mathbb{R}^{n - n_1 \times n - n_1}\) are symmetric, and \(B_{12} \in \mathbb{R}^{n_1 \times n - n_1}\) . Then the Schur complement of the block \(B_{11}\), i.e., the matrix \(B/B_{11} := B_{22} - B_{12}^T B_{11}^{-1} B_{12}\), is symmetric and positive definite. The same holds for \(B/B_{22} := B_{11} - B_{12} B_{22}^{-1} B_{12}^T\). \(\flat\)

Proof: By the Principal Submatrices Lemma, \(B_{11}\) is positive definite. By the Invertibility of Positive Definite Matrices Lemma, \(B_{11}\) is therefore invertible. Hence the Schur complement is well defined. Moreover, it is symmetric since

\[ (B/B_{11})^T = B_{22}^T - (B_{12}^T B_{11}^{-1} B_{12})^T = B_{22} - B_{12}^T B_{11}^{-1} B_{12} = B/B_{11}, \]

by the symmetry of \(B_{11}\), \(B_{22}\), and \(B_{11}^{-1}\) (prove that last one!).

For a non-zero \(\mathbf{x} \in \mathbb{R}^{n_2}\), let

\[\begin{split} \mathbf{z} = \begin{pmatrix} \mathbf{z}_1\\ \mathbf{z}_2 \end{pmatrix} = \begin{pmatrix} B_{11}^{-1} B_{12} \mathbf{x}\\ - \mathbf{x} \end{pmatrix}. \end{split}\]

The result then follows from the observation that

\[\begin{align*} &\mathbf{z}^T \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix} \mathbf{z}\\ &= \mathbf{z}_1^T B_{11} \mathbf{z}_1 + 2 \mathbf{z}_1^T B_{12} \mathbf{z}_2 + \mathbf{z}_2^T B_{22} \mathbf{z}_2\\ &= (B_{11}^{-1} B_{12} \mathbf{x})^T B_{11} B_{11}^{-1} B_{12} \mathbf{x} + 2 (B_{11}^{-1} B_{12} \mathbf{x})^T B_{12} (- \mathbf{x}) + (- \mathbf{x})^T B_{22} (- \mathbf{x})\\ &= \mathbf{x}^T B_{12}^T B_{11}^{-1} B_{12}\,\mathbf{x} - 2 \mathbf{x}^T B_{12}^T B_{11}^{-1} B_{12}\,\mathbf{x} + \mathbf{x}^T B_{22}\,\mathbf{x}\\ &= \mathbf{x}^T(B_{22} - B_{12}^T B_{11}^{-1} B_{12})\,\mathbf{x}. \end{align*}\]

\(\square\)

Inverting a block matrix We will need a classical formula for inverting a block matrix. We restrict ourselves to the positive definite case, where the Principal Submatrices and Schur Complement lemmas guarantee the required invertibility conditions.

LEMMA (Inverting a Block Matrix) Let \(B \in \mathbb{R}^{n \times n}\) be symmetric and positive definite and write it in block form as

\[\begin{split} B = \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix} \end{split}\]

where \(B_{11} \in \mathbb{R}^{n_1 \times n_1}\) and \(B_{22} \in \mathbb{R}^{n - n_1 \times n - n_1}\) are symmetric, and \(B_{12} \in \mathbb{R}^{n_1 \times n - n_1}\). Then

\[\begin{split} B^{-1} = \begin{pmatrix} (B/B_{22})^{-1} & - (B/B_{22})^{-1} B_{12} B_{22}^{-1}\\ - B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} & B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} B_{12} B_{22}^{-1} + B_{22}^{-1} \end{pmatrix}. \end{split}\]

Alternatively,

\[\begin{split} B^{-1} = \begin{pmatrix} B_{11}^{-1} B_{12} (B/B_{11})^{-1} B_{12}^T B_{11}^{-1} + B_{11}^{-1} & - B_{11}^{-1} B_{12} (B/B_{11})^{-1}\\ - (B/B_{11})^{-1} B_{12}^T B_{11}^{-1} & (B/B_{11})^{-1} \end{pmatrix}. \end{split}\]

\(\flat\)

In particular, the matrix \(B^{-1}\) is symmetric (prove it!).

Proof idea: One way to prove this is to multiply \(B\) and \(B^{-1}\) and check that the identity matrix comes out (try it!). We give a longer proof that provides more insight into where the formula is coming from.

The trick is to multiply \(B\) on the left and right by carefully chosen block triangular matrices with identity matrices on the diagonal (which we know are invertible by a previous example) to produce a block diagonal matrix with invertible matrices on the diagonal (which we know how to invert by a previous example).

Proof: We only prove the first formula. By the Principal Submatrices and Schur Complement lemmas, the formula for the inverse is well-defined. The rest of the proof is a calculation based on the formula for matrix products of block matrices. We will need that, if \(C\), \(D\), and \(E\) are invertible and of the same size, then \((CDE)^{-1} = E^{-1} D^{-1} C^{-1}\) (check it!).

We make a series of observations.

1- Our first step is to get a zero block in the upper right corner using an invertible matrix. Note that (recall that the order of multiplication matters here!)

\[\begin{align*} &\begin{pmatrix} I_{n_1 \times n_1} & - B_{12} B_{22}^{-1}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix}\\ &= \begin{pmatrix} B_{11} - B_{12} B_{22}^{-1} B_{12}^T & B_{12} - B_{12} B_{22}^{-1} B_{22}\\ \mathbf{0} + B_{12}^T & \mathbf{0} + B_{22} \end{pmatrix} = \begin{pmatrix} B/B_{22} & \mathbf{0}\\ B_{12}^T & B_{22} \end{pmatrix}. \end{align*}\]

2- Next we get a zero block in the bottom left corner. Then, starting from the final matrix in the last display,

\[\begin{align*} & \begin{pmatrix} B/B_{22} & \mathbf{0}\\ B_{12}^T & B_{22} \end{pmatrix} \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix}\\ &= \begin{pmatrix} B/B_{22} + \mathbf{0} & \mathbf{0} + \mathbf{0} \\ B_{12}^T - B_{22} B_{22}^{-1} B_{12}^T & \mathbf{0} + B_{22} \end{pmatrix} = \begin{pmatrix} B/B_{22} & \mathbf{0}\\ \mathbf{0} & B_{22} \end{pmatrix}. \end{align*}\]

3- Combining the last two steps, we have shown that

\[\begin{split} \begin{pmatrix} I_{n_1 \times n_1} & - B_{12} B_{22}^{-1}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix} \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix} = \begin{pmatrix} B/B_{22} & \mathbf{0}\\ \mathbf{0} & B_{22} \end{pmatrix}. \end{split}\]

Using the formula for the inverse of a product of three invertible matrices, we obtain

\[\begin{align*} &\begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix}^{-1} \begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix}^{-1} \begin{pmatrix} I_{n_1 \times n_1} & - B_{12} B_{22}^{-1}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix}^{-1}\\ &= \begin{pmatrix} B/B_{22} & \mathbf{0}\\ \mathbf{0} & B_{22} \end{pmatrix}^{-1}. \end{align*}\]

4- Rearranging and using the formula for the inverse of a block diagonal matrix, we finally get

\[\begin{align*} &\begin{pmatrix} B_{11} & B_{12}\\ B_{12}^T & B_{22} \end{pmatrix}^{-1}\\ &= \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} B/B_{22} & \mathbf{0}\\ \mathbf{0} & B_{22} \end{pmatrix}^{-1} \begin{pmatrix} I_{n_1 \times n_1} & - B_{12} B_{22}^{-1}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix}\\ &= \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} (B/B_{22})^{-1} & \mathbf{0}\\ \mathbf{0} & B_{22}^{-1} \end{pmatrix} \begin{pmatrix} I_{n_1 \times n_1} & - B_{12} B_{22}^{-1}\\ \mathbf{0} & I_{n_2 \times n_2} \end{pmatrix}\\ &= \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} (B/B_{22})^{-1} + \mathbf{0}& - (B/B_{22})^{-1} B_{12} B_{22}^{-1} + \mathbf{0}\\ \mathbf{0} + \mathbf{0} & \mathbf{0} + B_{22}^{-1} \end{pmatrix}\\ &= \begin{pmatrix} I_{n_1 \times n_1} & \mathbf{0}\\ - B_{22}^{-1} B_{12}^T & I_{n_2 \times n_2} \end{pmatrix} \begin{pmatrix} (B/B_{22})^{-1} & - (B/B_{22})^{-1} B_{12} B_{22}^{-1} \\ \mathbf{0} & B_{22}^{-1} \end{pmatrix}\\ &= \begin{pmatrix} (B/B_{22})^{-1} & - (B/B_{22})^{-1} B_{12} B_{22}^{-1}\\ - B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} & B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} B_{12} B_{22}^{-1} + B_{22}^{-1} \end{pmatrix}, \end{align*}\]

as claimed. \(\square\)

Marginals and conditionals We are now ready to derive the distribution of marginals and conditionals of multivariate Gaussians. Recall that a multivariate Gaussian vector \(\mathbf{X} = (X_1,\ldots,X_d)\) on \(\mathbb{R}^d\) with mean \(\bmu \in \mathbb{R}^d\) and positive definite covariance matrix \(\bSigma \in \mathbb{R}^{d \times d}\) has probability density function

\[ f_{\bmu, \bSigma}(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} \,|\bSigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\right). \]

Recall that one way to compute \(|\bSigma|\) is as the product of all eigenvalues of \(\bSigma\) (with repeats). The matrix \(\bLambda = \bSigma^{-1}\) is called the precision matrix.

Partition \(\mathbf{X}\) as the column vector \((\mathbf{X}_1, \mathbf{X}_2)\) where \(\mathbf{X}_i \in \mathbb{R}^{d_i}\), \(i=1,2\), with \(d_1 + d_2 = d\). Similarly, consider the corresponding block vectors and matrices

\[\begin{split} \bmu = \begin{pmatrix} \bmu_1\\ \bmu_2 \end{pmatrix} \qquad \bSigma = \begin{pmatrix} \bSigma_{11} & \bSigma_{12}\\ \bSigma_{21} & \bSigma_{22} \end{pmatrix} \qquad \bLambda = \begin{pmatrix} \bLambda_{11} & \bLambda_{12}\\ \bLambda_{21} & \bLambda_{22} \end{pmatrix}. \end{split}\]

Note that by the symmetry of \(\bSigma\), we have \(\bSigma_{21} = \bSigma_{12}^T\). Furthemore, it can be proved that a symmetric, invertible matrix has a symmetric inverse (try it!) so that \(\bLambda_{21} = \bLambda_{12}^T\).

We seek to compute the marginals \(f_{\mathbf{X}_1}(\mathbf{x}_1)\) and \(f_{\mathbf{X}_2}(\mathbf{x}_2)\), as well as the conditional density of \(\mathbf{X}_1\) given \(\mathbf{X}_2\), which we denote as \(f_{\mathbf{X}_1|\mathbf{X}_2}(\mathbf{x}_1|\mathbf{x}_2)\), and similarly \(f_{\mathbf{X}_2|\mathbf{X}_1}(\mathbf{x}_2|\mathbf{x}_1)\).

By the multiplication rule, the joint density \(f_{\mathbf{X}_1,\mathbf{X}_2}(\mathbf{x}_1,\mathbf{x}_2)\) can be decomposed as

\[ f_{\mathbf{X}_1,\mathbf{X}_2}(\mathbf{x}_1,\mathbf{x}_2) = f_{\mathbf{X}_1|\mathbf{X}_2}(\mathbf{x}_1|\mathbf{x}_2) f_{\mathbf{X}_2}(\mathbf{x}_2). \]

We use the Inverting a Block Matrix lemma to rewrite \(f_{\mathbf{X}_1,\mathbf{X}_2}(\mathbf{x}_1,\mathbf{x}_2)\) in this form and “reveal” the marginal and conditional involved. Indeed, once the joint density is in this form, by integrating over \(\mathbf{x}_1\) we obtain that the marginal density of \(\mathbf{X}_2\) is \(f_{\mathbf{X}_2}\) and the conditional density of \(\mathbf{X}_1\) given \(\mathbf{X}_2\) is obtained by taking the ratio of the joint and the marginal.

In fact, it will be easier to work with an expression derived in the proof of that lemma, specifically

\[\begin{align*} &\begin{pmatrix} \bSigma_{11} & \bSigma_{12}\\ \bSigma_{21} & \bSigma_{22} \end{pmatrix}^{-1}\\ &= \begin{pmatrix} I_{d_1 \times d_1} & \mathbf{0}\\ - \bSigma_{22}^{-1} \bSigma_{12}^T & I_{d_2 \times d_2} \end{pmatrix} \begin{pmatrix} (\bSigma/\bSigma_{22})^{-1} & \mathbf{0}\\ \mathbf{0} & \bSigma_{22}^{-1} \end{pmatrix} \begin{pmatrix} I_{d_1 \times d_1} & - \bSigma_{12} \bSigma_{22}^{-1}\\ \mathbf{0} & I_{d_2 \times d_2} \end{pmatrix}. \end{align*}\]

We will use the fact that the first matrix on the last line is the transpose of the third one (check it!).

To evaluate the joint density, we need to expand the quadratic function \((\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\) appearing in the exponential. We break this up in a few steps.

1- Note first that

\[\begin{split} \begin{pmatrix} I_{d_1 \times d_1} & - \bSigma_{12} \bSigma_{22}^{-1}\\ \mathbf{0} & I_{d_2 \times d_2} \end{pmatrix} \begin{pmatrix} \mathbf{x}_1 - \bmu_1\\ \mathbf{x}_2 - \bmu_2 \end{pmatrix} = \begin{pmatrix} (\mathbf{x}_1 - \bmu_1) - \bSigma_{12} \bSigma_{22}^{-1} (\mathbf{x}_2 - \bmu_2) \\ \mathbf{x}_2 - \bmu_2 \end{pmatrix} \end{split}\]

and similarly for its transpose. We define

\[ \bmu_{1|2}(\mathbf{x}_2) = \bmu_1 + \bSigma_{12} \bSigma_{22}^{-1} (\mathbf{x}_2 - \bmu_2). \]

2- Plugging this back in the quadratic function, we get

\[\begin{align*} &(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\\ &= \begin{pmatrix} \mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2) \\ \mathbf{x}_2 - \bmu_2 \end{pmatrix}^T \begin{pmatrix} (\bSigma/\bSigma_{22})^{-1} & \mathbf{0}\\ \mathbf{0} & \bSigma_{22}^{-1} \end{pmatrix} \begin{pmatrix} \mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2) \\ \mathbf{x}_2 - \bmu_2 \end{pmatrix}\\ &= (\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2))^T (\bSigma/\bSigma_{22})^{-1} (\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2)) + (\mathbf{x}_2 - \bmu_2)^T \bSigma_{22}^{-1} (\mathbf{x}_2 - \bmu_2). \end{align*}\]

Note that both terms have the same form as the original quadratic function.

3- Going back to the density, we use \(\propto\) to indicate that the expression holds up to a constant not depending on \(\mathbf{x}\). Using that exponential of a sum is the product of exponentials, we obtain

\[\begin{align*} &f_{\mathbf{X}_1,\mathbf{X}_2}(\mathbf{x}_1,\mathbf{x}_2)\\ &\propto \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\right)\\ &\propto \exp\left(-\frac{1}{2}(\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2))^T (\bSigma/\bSigma_{22})^{-1} (\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2))\right)\\ & \qquad \times \exp\left(-\frac{1}{2}(\mathbf{x}_2 - \bmu_2)^T \bSigma_{22}^{-1} (\mathbf{x}_2 - \bmu_2)\right). \end{align*}\]

4- We have shown that

\[ f_{\mathbf{X}_1|\mathbf{X}_2}(\mathbf{x}_1|\mathbf{x}_2) \propto \exp\left(-\frac{1}{2}(\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2))^T (\bSigma/\bSigma_{22})^{-1} (\mathbf{x}_1 - \bmu_{1|2}(\mathbf{x}_2))\right) \]

and

\[ f_{\mathbf{X}_2}(\mathbf{x}_2) \propto \exp\left(-\frac{1}{2}(\mathbf{x}_2 - \bmu_2)^T \bSigma_{22}^{-1} (\mathbf{x}_2 - \bmu_2)\right). \]

In other words, the marginal density of \(\mathbf{X}_2\) is multivariate Gaussian with mean \(\bmu_2\) and covariance \(\bSigma_{22}\). The conditional density of \(\mathbf{X}_1\) given \(\mathbf{X}_2\) is multivariate Gaussian with mean \(\bmu_{1|2}(\mathbf{X}_2)\) and covariance \(\bSigma/\bSigma_{22} = \bSigma_{11} - \bSigma_{12} \bSigma_{22}^{-1} \bSigma_{12}^T\). We write this as

\[ \mathbf{X}_1|\mathbf{X}_2 \sim N_{d_1}(\bmu_{1|2}(\mathbf{X}_2), \bSigma/\bSigma_{22}) \]

and

\[ \mathbf{X}_2 \sim N_{d_2}(\bmu_2, \bSigma_{22}). \]

Similarly, by exchanging the roles of \(\mathbf{X}_1\) and \(\mathbf{X}_2\), we see that the marginal density of \(\mathbf{X}_1\) is multivariate Gaussian with mean \(\bmu_1\) and covariance \(\bSigma_{11}\). The conditional density of \(\mathbf{X}_2\) given \(\mathbf{X}_1\) is multivariate Gaussian with mean \(\bmu_{2|1}(\mathbf{X}_1) = \bmu_2 + \bSigma_{21} \bSigma_{11}^{-1} (\mathbf{X}_1 - \bmu_1)\) and covariance \(\bSigma/\bSigma_{11} = \bSigma_{22} - \bSigma_{12}^T \bSigma_{11}^{-1} \bSigma_{12}\).

MULTIPLE CHOICE: Suppose \((X_1, X_2)\) is a bivariate Gaussian with mean \((0,0)\), variances \(2\) and correlation coefficient \(-1/2\). Conditioned on \(X_2 = 1\), what is the mean of \(X_1\)?

a) \(1/2\)

b) \(-1/2\)

c) \(1\)

d) \(-1\)

\(\ddagger\)

7.4.2. Kalman filter#

We consider a stochastic process \(\{\bX_t\}_{t=0}^T\) (i.e., a collection of random vectors indexed by time) with state space \(\S = \mathbb{R}^{d_0}\) of the following form

\[ \bX_{t+1} = F \,\bX_t + \bW_t \]

where the \(\bW_t\)s are i.i.d. \(N_{d_0}(\mathbf{0}, Q)\) and \(F\) and \(Q\) are known \(d_0 \times d_0\) matrices. We denote the initial state by \(\bX_0 \sim N_{d_0}(\bmu_0, \bSigma_0)\). We assume that the process \(\{\bX_t\}_{t=1}^T\) is not observed, but rather that an auxiliary observed process \(\{\bY_t\}_{t=1}^T\) with state space \(\S = \mathbb{R}^{d}\) satisfies

\[ \bY_t = H\,\bX_t + \bV_t \]

where the \(\bV_t\)s are i.i.d. \(N_d(\mathbf{0}, R)\) and \(H \in \mathbb{R}^{d \times d_0}\) and \(R \in \mathbb{R}^{d \times d}\) are known matrices. This is an example of a linear-Gaussian system (also known as linear-Gaussian state space model).

Our goal is to infer the unobserved states given the observed process. Specifically, we look at the filtering problem. Quoting Wikipedia:

The task is to compute, given the model’s parameters and a sequence of observations, the distribution over hidden states of the last latent variable at the end of the sequence, i.e. to compute P(x(t)|y(1),…,y(t)). This task is normally used when the sequence of latent variables is thought of as the underlying states that a process moves through at a sequence of points of time, with corresponding observations at each point in time. Then, it is natural to ask about the state of the process at the end.

Key lemma Given the structure of the linear-Gaussian model, the following lemma will play a key role.

LEMMA: (Linear-Gaussian System) Let \(\bW \sim N_{d}(\bmu, \bSigma)\) and \(\bW'|\bW \sim N_{d'}(A \,\bW, \bSigma')\) where \(A \in \mathbb{R}^{d' \times d}\) is a deterministic matrix, and \(\bSigma \in \mathbb{R}^{d \times d}\) and \(\bSigma' \in \mathbb{R}^{d' \times d'}\) are positive definite. Then, \((\bW, \bW')\) is multivariate Gaussian with mean vector

\[\begin{split} \bmu'' = \begin{pmatrix} \bmu \\ A\bmu \end{pmatrix} \end{split}\]

and positive definite covariance matrix

\[\begin{split} \bSigma'' = \begin{pmatrix} \bSigma & \bSigma A^T\\ A \bSigma & A \bSigma A^T + \bSigma' \end{pmatrix}. \end{split}\]

\(\flat\)

THINK-PAIR-SHARE: What are the dimensions of each block of \(\bSigma''\)? \(\ddagger\)

Proof: We have

\[\begin{align*} &f_{\bW, \bW'}(\bw, \bw')\\ &= f_{\bW}(\bw) \, f_{\bW'|\bW}(\bw'|\bw)\\ &\propto \exp\left(-\frac{1}{2}(\bw - \bmu)^T \bSigma^{-1} (\bw - \bmu)\right)\\ & \qquad \times \exp\left(-\frac{1}{2}(\bw' - A \bw)^T (\bSigma')^{-1} (\bw' - A \bw)\right)\\ &\propto \exp\left(-\frac{1}{2}\left[(\bw - \bmu)^T \bSigma^{-1} (\bw - \bmu) + (\bw' - A \bw)^T (\bSigma')^{-1} (\bw' - A \bw)\right]\right). \end{align*}\]

We re-write the quadratic function in square brackets in the exponent as follows

\[\begin{align*} &(\bw - \bmu)^T \bSigma^{-1} (\bw - \bmu) + (\bw' - A \bw)^T (\bSigma')^{-1} (\bw' - A \bw)\\ &=(\bw - \bmu)^T \bSigma^{-1} (\bw - \bmu)\\ & \qquad + ([\bw' - A\bmu] - A[\bw-\bmu])^T (\bSigma')^{-1} ([\bw' - A\bmu] - A[\bw-\bmu])\\ &=(\bw - \bmu)^T (A^T (\bSigma')^{-1} A + \bSigma^{-1}) (\bw - \bmu)\\ & \qquad - 2 (\bw-\bmu)^T A^T (\bSigma')^{-1} (\bw' - A\bmu) + (\bw' - A\bmu)^T (\bSigma')^{-1} (\bw' - A\bmu)\\ &= \begin{pmatrix} \bw - \bmu \\ \bw' - A\bmu \end{pmatrix}^T \begin{pmatrix} A^T (\bSigma')^{-1} A + \bSigma^{-1} & - A^T (\bSigma')^{-1}\\ - (\bSigma')^{-1} A & (\bSigma')^{-1} \end{pmatrix} \begin{pmatrix} \bw - \bmu \\ \bw' - A\bmu \end{pmatrix}\\ &= \begin{pmatrix} \bw - \bmu \\ \bw' - A\bmu \end{pmatrix}^T \bLambda'' \begin{pmatrix} \bw - \bmu \\ \bw' - A\bmu \end{pmatrix}, \end{align*}\]

where the last line defines \(\bLambda''\) and the second line shows that it is positive definite (why?). We have shown that \((\bW, \bW')\) is multivariate Gaussian with mean \((\bmu, A\bmu)\).

We use the Inverting a Block Matrix lemma to invert \(\bLambda''\) and reveal the covariance matrix. (One could also compute the covariance directly; try it!) We break up \(\bLambda''\) into blocks

\[\begin{split} \bLambda'' = \begin{pmatrix} \bLambda''_{11} & \bLambda''_{12}\\ (\bLambda''_{12})^T & \bLambda''_{22} \end{pmatrix} \end{split}\]

where \(\bLambda''_{11} \in \mathbb{R}^{d \times d}\), \(\bLambda''_{22} \in \mathbb{R}^{d' \times d'}\), and \(\bLambda''_{12} \in \mathbb{R}^{d \times d'}\). Recall that the inverse is

\[\begin{split} B^{-1} = \begin{pmatrix} (B/B_{22})^{-1} & - (B/B_{22})^{-1} B_{12} B_{22}^{-1}\\ - B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} & B_{22}^{-1} B_{12}^T (B/B_{22})^{-1} B_{12} B_{22}^{-1} + B_{22}^{-1} \end{pmatrix} \end{split}\]

where here \(B = \bLambda''\).

The Schur complement is

\[\begin{align*} \bLambda'' / \bLambda''_{22} &= \bLambda''_{11} - \bLambda''_{12} (\bLambda''_{22})^{-1} (\bLambda''_{12})^T\\ &= A^T (\bSigma')^{-1} A + \bSigma^{-1} - (- A^T (\bSigma')^{-1}) ((\bSigma')^{-1})^{-1} (- A^T (\bSigma')^{-1})^T\\ &= A^T (\bSigma')^{-1} A + \bSigma^{-1} - A^T (\bSigma')^{-1} \bSigma' (\bSigma')^{-1} A\\ &= A^T (\bSigma')^{-1} A + \bSigma^{-1} - A^T (\bSigma')^{-1} A\\ &= \bSigma^{-1}. \end{align*}\]

Hence \((\bLambda'' / \bLambda''_{22})^{-1} = \bSigma\).

Moreover,

\[\begin{align*} - (\bLambda''/\bLambda''_{22})^{-1} \bLambda''_{12} (\bLambda''_{22})^{-1} &= - \bSigma (- A^T (\bSigma')^{-1}) ((\bSigma')^{-1})^{-1}\\ &= \bSigma A^T (\bSigma')^{-1} \bSigma'\\ &= \bSigma A^T \end{align*}\]

and

\[\begin{align*} &(\bLambda''_{22})^{-1} (\bLambda''_{12})^T (\bLambda''/\bLambda''_{22})^{-1} \bLambda''_{12} (\bLambda''_{22})^{-1} + (\bLambda''_{22})^{-1}\\ &= ((\bSigma')^{-1})^{-1} (- A^T (\bSigma')^{-1})^T \bSigma (-A^T (\bSigma')^{-1}) ((\bSigma')^{-1})^{-1} + ((\bSigma')^{-1})^{-1}\\ &= \bSigma' (\bSigma')^{-1} A \bSigma A^T (\bSigma')^{-1} \bSigma' + \bSigma'\\ &= A \bSigma A^T + \bSigma'. \end{align*}\]

Plugging back into the formula for the inverse of a block matrix completes the proof. \(\square\)

Joint distribution We extend our previous notation as follows: for two disjoint, finite collections of random vectors \(\{\mathbf{U}_i\}_i\) and \(\{\mathbf{W}_j\}_j\) defined on the same probability space, we let \(f_{\{\mathbf{U}_i\}_i, \{\mathbf{W}_j\}_j}\) be their joint density, \(f_{\{\mathbf{U}_i\}_i | \{\mathbf{W}_j\}_j}\) be the conditional density of \(\{\mathbf{U}_i\}_i\) given \(\{\mathbf{W}_j\}_j\), and \(f_{\{\mathbf{U}_i\}_i}\) be the marginal density of \(\{\mathbf{U}_i\}_i\). Formally, our goal is to compute

\[ f_{\bX_{t}|\bY_{1:t}} := f_{\{\bX_{t}\}|\{\bY_1,\ldots,\bY_{t}\}} \]

recursively in \(t\), where we define \(\bY_{1:t} = \{\bY_1,\ldots,\bY_{t}\}\). We will see that all densities appearing in this calculation are multivariate Gaussians, and therefore keeping track of the means and covariances will suffice.

Formally, we posit that the density of the full process (with both observed and unobserved parts) is

\[ f_{\bX_0}(\bx_0) \prod_{t=1}^{T} f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1}) f_{\bY_{t}|\bX_{t}}(\by_{t}|\bx_{t}) \]

where the description of the model stipulates that

\[ \bX_{t}|\bX_{t-1} \sim N_{d_0}(F \,\bX_{t-1}, Q). \]

and

\[ \bY_t|\bX_t \sim N_d(H\,\bX_t, R) \]

for all \(t \geq 1\) and \(\bX_0 \sim N_{d_0}(\bmu_0, \bSigma_0)\). We assume that \(\bmu_0\) and \(\bSigma_0\) are known. Applying the Linear-Gaussian System lemma inductively shows that the full process \((\bX_{0:T}, \bY_{1:T})\) is jointly multivariate Gaussian (try it!). In particular, all marginals and conditionals are multivariate Gaussians.

Graphically, it can be represented as follows, where as before each variable is node and its conditional distribution depends only on its parent nodes.

Figure: Linear-Gaussian system (Source)

HMM

\(\bowtie\)

Conditional independence The stipulated form of the joint density of the full process implies many conditional independence relations. We will need the following two:

1- \(\bX_t\) is conditionally independent of \(\bY_{1:t-1}\) given \(\bX_{t-1}\)

2- \(\bY_t\) is conditionally independent of \(\bY_{1:t-1}\) given \(\bX_{t}\).

We prove the first one and leave the second one as an exercise. In fact, we prove something stronger: \(\bX_t\) is conditionally independent of \(\bX_{0:t-2}, \bY_{1:t-1}\) given \(\bX_{t-1}\).

First, by integrating over \(\by_T\), then \(\bx_{T}\), then \(\by_{T-1}\), then \(\bx_{T-1}\), … , then \(\by_t\), we see that the joint density of \((\bX_{0:t}, \bY_{1:t-1})\) is

\[ f_{\bX_0}(\bx_0) \left(\prod_{s=1}^{t-1} f_{\bX_{s}|\bX_{s-1}}(\bx_{s}|\bx_{s-1}) f_{\bY_{s}|\bX_{s}}(\by_{s}|\bx_{s})\right) f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1}). \]

Similarly, the joint density of \((\bX_{0:t-1}, \bY_{1:t-1})\) is

\[ f_{\bX_0}(\bx_0) \left(\prod_{s=1}^{t-1} f_{\bX_{s}|\bX_{s-1}}(\bx_{s}|\bx_{s-1}) f_{\bY_{s}|\bX_{s}}(\by_{s}|\bx_{s})\right). \]

The conditional density given \(\bX_{t-1}\) is then obtained by dividing the first expression by the marginal density of \(\bX_{t-1}\)

\[\begin{align*} &\frac{f_{\bX_0}(\bx_0) \left(\prod_{s=1}^{t-1} f_{\bX_{s}|\bX_{s-1}}(\bx_{s}|\bx_{s-1}) f_{\bY_{s}|\bX_{s}}(\by_{s}|\bx_{s})\right) f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1})}{f_{\bX_{t-1}}(\bx_{t-1})}\\ &= \frac{f_{\bX_0}(\bx_0) \left(\prod_{s=1}^{t-1} f_{\bX_{s}|\bX_{s-1}}(\bx_{s}|\bx_{s-1}) f_{\bY_{s}|\bX_{s}}(\by_{s}|\bx_{s})\right) f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1})}{f_{\bX_{t-1}}(\bx_{t-1})}\\ &= \frac{f_{\bX_0}(\bx_0) \left(\prod_{s=1}^{t-1} f_{\bX_{s}|\bX_{s-1}}(\bx_{s}|\bx_{s-1}) f_{\bY_{s}|\bX_{s}}(\by_{s}|\bx_{s})\right) }{f_{\bX_{t-1}}(\bx_{t-1})}f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1})\\ &= f_{\bX_{0:t-2}, \bY_{1:t-1}|\bX_{t-1}}(\bx_{0:t-2}, \by_{1:t-1}|\bx_{t-1}) f_{\bX_{t}|\bX_{t-1}}(\bx_{t}|\bx_{t-1})\\ \end{align*}\]

as claimed.

Algorithm We give a recursive algorithm for solving the filtering problem, that is, for computing the mean vector and covariance matrix of the conditional density \(f_{\bX_{t}|\bY_{1:t}}\).

Initial step: The first compute \(f_{\bX_1|\bY_1}\). We do this through a series of observations which will generalize straightforwardly.

1- We have \(\bX_0 \sim N_{d_0}(\bmu_0, \bSigma_0)\) and \(\bX_{1}|\bX_{0} \sim N_{d_0}(F \,\bX_0, Q)\). So, by the Linear-Gaussian System lemma, the joint vector \((\bX_0, \bX_{1})\) is multivariate Gaussian with mean vector \((\bmu_0, F \bmu_0)\) and covariance matrix

\[\begin{split} \begin{pmatrix} \bSigma_0 & \bSigma_0 F^T\\ F \bSigma_0 & F \bSigma_0 F^T + Q \end{pmatrix}. \end{split}\]

Hence the marginal density of \(\bX_{1}\) is multivariate Gaussian with mean vector \(F \bmu_0\) and covariance matrix

\[ P_0 := F \bSigma_0 F^T + Q. \]

2- Combining the previous observation about the marginal density of \(\bX_{1}\) with the fact that \(\bY_1|\bX_1 \sim N_d(H\,\bX_1, R)\), the Linear-Gaussian System lemma says that \((\bX_1, \bY_{1})\) is multivariate Gaussian with mean vector \((F \bmu_0, H F \bmu_0)\) and covariance matrix

\[\begin{split} \begin{pmatrix} P_0 & P_0 H^T\\ H P_0 & H P_0 H^T + R \end{pmatrix}. \end{split}\]

Finally, define \(K_1 := P_0 H^T (H P_0 H^T + R)^{-1}\). This new observation and the conditional density formula give that

\[ \bX_1|\bY_1 \sim N_d(\bmu_1, \bSigma_1) \]

where we define

\[\begin{align*} \bmu_1 &:= F \bmu_0 + P_0 H^T (H P_0 H^T + R)^{-1} (\mathbf{Y}_1 - H F \bmu_0)\\ &= F \bmu_0 + K_1 (\mathbf{Y}_1 - H F \bmu_0) \end{align*}\]

and

\[\begin{align*} \bSigma_1 &:= P_0 - P_0 H^T (H P_0 H^T + R)^{-1} H P_0\\ &= (I_{d_0 \times d_0} - K_1 H) P_0. \end{align*}\]

General step: Assuming by induction that \(\bX_{t-1}|\bY_{1:t-1} \sim N_{d_0}(\bmu_{t-1}, \bSigma_{t-1})\) where \(\bmu_{t-1}\) depends implicitly on \(\bY_{1:t-1}\) (but \(\bSigma_{t-1}\) does not), we deduce the next step. It mimics closely the initial step.

1- Predict: We first “predict” \(\bX_{t}\) given \(\bY_{1:t-1}\). We use the fact that \(\bX_{t-1}|\bY_{1:t-1} \sim N_{d_0}(\bmu_{t-1}, \bSigma_{t-1})\). Moreover, we have that \(\bX_{t}|\bX_{t-1} \sim N_{d_0}(F \,\bX_{t-1}, Q)\) and that \(\bX_t\) is conditionally independent of \(\bY_{1:t-1}\) given \(\bX_{t-1}\). So, \(\bX_{t}|\{\bX_{t-1}, \bY_{1:t-1}\} \sim N_{d_0}(F \,\bX_{t-1}, Q)\) by the Role of Independence lemma. By the Linear-Gaussian System lemma, the joint vector \((\bX_{t-1}, \bX_{t})\) conditioned on \(\bY_{1:t-1}\) is multivariate Gaussian with mean vector \((\bmu_{t-1}, F \bmu_{t-1})\) and covariance matrix

\[\begin{split} \begin{pmatrix} \bSigma_{t-1} & \bSigma_{t-1} F^T\\ F \bSigma_{t-1} & F \bSigma_{t-1} F^T + Q \end{pmatrix}. \end{split}\]

As a consequence, the conditional marginal density of \(\bX_{t}\) given \(\bY_{1:t-1}\) is multivariate Gaussian with mean vector \(F \bmu_{t-1}\) and covariance matrix

\[ P_{t-1} := F \bSigma_{t-1} F^T + Q. \]

2- Update: Next we “update” our prediction of \(\bX_{t}\) using the new observation \(\bY_{t}\). We have that \(\bY_{t}|\bX_{t} \sim N_d(H\,\bX_{t}, R)\) and that \(\bY_t\) is conditionally independent of \(\bY_{1:t-1}\) given \(\bX_{t}\). So \(\bY_{t}|\{\bX_{t}, \bY_{1:t-1}\} \sim N_d(H\,\bX_{t}, R)\) by the Role of Independence lemma. Combining this with the previous observation, the Linear-Gaussian System lemma says that \((\bX_{t}, \bY_{t})\) given \(\bY_{1:t-1}\) is multivariate Gaussian with mean vector \((F \bmu_{t-1}, H F \bmu_{t-1})\) and covariance matrix

\[\begin{split} \begin{pmatrix} P_{t-1} & P_{t-1} H^T\\ H P_{t-1} & H P_{t-1} H^T + R \end{pmatrix}. \end{split}\]

Finally, define \(K_{t} := P_{t-1} H^T (H P_{t-1} H^T + R)^{-1}\). This new observation and the conditional density formula give that

\[ \bX_{t}|\{\bY_t, \bY_{1:t-1}\} = \bX_{t}|\bY_{1:t} \sim N_d(\bmu_{t}, \bSigma_{t}) \]

where we define

\[\begin{align*} \bmu_{t} &:= F \bmu_{t-1} + K_{t} (\mathbf{Y}_{t} - H F \bmu_{t-1}) \end{align*}\]

and

\[\begin{align*} \bSigma_{t} &= (I_{d_0 \times d_0} - K_{t} H) P_{t-1}. \end{align*}\]

Summary: Let \(\bmu_t\) and \(\bSigma_t\) be the mean and covariance matrix of \(\bX_t\) conditioned on \(\bY_{1:t}\). The recursions for these quantities are the following:

\[\begin{align*} \bmu_t &= F\,\bmu_{t-1} + K_{t} (\bY_{t} - H F \bmu_{t-1})\\ \bSigma_t &= (I_{d_0 \times d_0} - K_t H) P_{t-1} \end{align*}\]

where

\[\begin{align*} P_{t-1} &= F \,\bSigma_{t-1} F^T + Q\\ K_t &= P_{t-1} H^T (H P_{t-1} H^T + R)^{-1} \end{align*}\]

This last matrix is known as the Kalman gain matrix. The solution above is known as Kalman filtering.

7.4.3. Location tracking#

We apply Kalman filtering to location tracking. Returning to our cyborg corgi example, we imagine that we get noisy observations about its successive positions in a park. (Think of GPS measurements.) We seek to get a better estimate of its location using the method above. See for example this blog post on location tracking.

Figure: Cyborg corgi (Credit: Made with Midjourney)

Cyborg corgi

\(\bowtie\)

We model the true location as a linear-Gaussian system over the 2d position \((z_{1,t}, z_{2,t})_t\) and velocity \((\dot{z}_{1,t}, \dot{z}_{2,t})_t\) sampled at \(\Delta\) intervals of time. Formally,

\[\begin{split} \bX_t = (z_{1,t}, z_{2,t}, \dot{z}_{1,t}, \dot{z}_{2,t}), \quad F = \begin{pmatrix} 1 & 0 & \Delta & 0\\ 0 & 1 & 0 & \Delta\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}, \end{split}\]

so the unobserved dynamics are

\[\begin{split} \begin{pmatrix} z_{1,t+1}\\ z_{2,t+1}\\ \dot{z}_{1,t+1}\\ \dot{z}_{2,t+1} \end{pmatrix} = \bX_{t+1} = F \,\bX_t + \bW_t = \begin{pmatrix} z_{1,t} + \Delta \dot{z}_{1,t} + W_{1,t}\\ z_{2,t} + \Delta \dot{z}_{2,t} + W_{2,t}\\ \dot{z}_{1,t} + \dot{W}_{1,t}\\ \dot{z}_{2,t} + \dot{W}_{2,t} \end{pmatrix} \end{split}\]

where the \(\bW_t = (W_{1,t}, W_{2,t}, \dot{W}_{1,t}, \dot{W}_{2,t}) \sim N_{d_0}(\mathbf{0}, Q)\) with \(Q\) known.

In words, the velocity is unchanged, up to Gaussian perturbation. The position changes proportionally to the velocity in the corresponding dimension.

The observations \((\tilde{z}_{1,t}, \tilde{z}_{2,t})_t\) are modeled as

\[\begin{split} \bY_t = (\tilde{z}_{1,t}, \tilde{z}_{2,t}), \quad H = \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ \end{pmatrix}. \end{split}\]

so the observed process satisfies

\[\begin{split} \begin{pmatrix} \tilde{z}_{1,t}\\ \tilde{z}_{2,t} \end{pmatrix} = \bY_t = H\,\bX_t + \bV_t = \begin{pmatrix} \tilde{z}_{1,t} + \tilde{V}_{1,t}\\ \tilde{z}_{2,t} + \tilde{V}_{2,t} \end{pmatrix} \end{split}\]

where the \(\bV_t = (\tilde{V}_{1,t}, \tilde{V}_{2,t}) \sim N_d(\mathbf{0}, R)\) with \(R\) known.

In words, we only observe the positions, up to Gaussian noise.

Implementing the Kalman filter We implement the Kalman filter as described above with known covariance matrices. We take \(\Delta = 1\) for simplicity. The code is adapted from [Mur].

We will test Kalman filtering on a simulated path drawn from the linear-Gaussian model above. The following function creates such a path and its noisy observations.

seed = 535
rng = np.random.default_rng(seed)
def lgSamplePath(ss, os, F, H, Q, R, init_mu, init_Sig, T):
    x = np.zeros((ss,T)) 
    y = np.zeros((os,T))

    x[:,0] = rng.multivariate_normal(init_mu, init_Sig)
    for t in range(1,T):
        x[:,t] = rng.multivariate_normal(F @ x[:,t-1],Q)
        y[:,t] = rng.multivariate_normal(H @ x[:,t],R)
    
    return x, y

Here is an example. In the plots, the red dots are the noisy observations and the green crosses are the unobserved true path.

ss = 4 # state size
os = 2 # observation size
F = np.array([[1., 0., 1., 0.],
              [0., 1., 0., 1.],
              [0., 0., 1., 0.],
              [0., 0., 0., 1.]]) 
H = np.array([[1., 0., 0., 0.],
              [0., 1, 0., 0.]])
Q = 0.1 * np.diag(np.ones(ss))
R = 10 * np.diag(np.ones(os))
init_mu = np.array([0., 0., 1., 1.])
init_Sig = 1 * np.diag(np.ones(ss))
T = 50
x, y = lgSamplePath(ss, os, F, H, Q, R, init_mu, init_Sig, T)

In the next plot (and throughout this section), the red dots are the noisy observations.

Hide code cell source
plt.plot(y[0,:], y[1,:], 
         marker='o', c='r', linestyle='dotted')
plt.xlim((np.min(y[0,:])-5, np.max(y[0,:])+5)) 
plt.ylim((np.min(y[1,:])-5, np.max(y[1,:])+5))
plt.show()
../../_images/92047740eaf7a5a015b0d33158946139c9c6c99cbb3907704b390e0894d7283f.png

In the next plot (and throughout this section), the green crosses are the unobserved true path.

Hide code cell source
plt.plot(x[0,:], x[1,:], 
         marker='x', c='g', linestyle='dashed')
plt.xlim((np.min(x[0,:])-5, np.max(x[0,:])+5)) 
plt.ylim((np.min(x[1,:])-5, np.max(x[1,:])+5))
plt.scatter(y[0,:], y[1,:], 
            c='r', alpha=0.5)
plt.show()
../../_images/ee0f4f72be93997713f37e8604a5739362760af11b8356c532ec70cd25da202b.png

The following function implements the Kalman filter. The full recursion is broken up into several steps. We use numpy.linalg.inv to compute the Kalman gain matrix. Below, mu_pred is \(F \bmu_{t-1}\) and Sig_pred is \(P_{t-1} = F \bSigma_{t-1} F^T + Q\), which are the mean vector and covariance matrix of \(\bX_{t}\) given \(\bY_{1:t-1}\) as computed in the Predict step.

def kalmanUpdate(ss, F, H, Q, R, y_t, mu_prev, Sig_prev):
    mu_pred = F @ mu_prev
    Sig_pred = F @ Sig_prev @ F.T + Q
    e_t = y_t - H @ mu_pred
    S = H @ Sig_pred @ H.T + R
    Sinv = LA.inv(S)
    K = Sig_pred @ H.T @ Sinv
    mu_new = mu_pred + K @ e_t
    Sig_new = (np.diag(np.ones(ss)) - K @ H) @ Sig_pred
    return mu_new, Sig_new
def kalmanFilter(ss, os, y, F, H, Q, R, init_mu, init_Sig, T):
    mu = np.zeros((ss, T))
    Sig = np.zeros((ss, ss, T))
    mu[:,0] = init_mu
    Sig[:,:,0] = init_Sig

    for t in range(1,T):
        mu[:,t], Sig[:,:,t] = kalmanUpdate(ss, F, H, Q, R, 
                                           y[:,t], mu[:,t-1], 
                                           Sig[:,:,t-1])

    return mu, Sig

We apply this to the location tracking example. The inferred states, or more precisely their estimated mean, are in blue. Note that we also inferred the velocity at each time point, but we are not plotting that information.

init_mu = np.array([0., 0., 1., 1.])
init_Sig = 1 * np.diag(np.ones(ss))
mu, Sig = kalmanFilter(ss, os, y, F, H, Q, R, init_mu, init_Sig, T)
Hide code cell source
plt.scatter(x[0,:], x[1,:], 
            marker='x', c='g', alpha=0.5)
plt.xlim((np.min(x[0,:])-5, np.max(x[0,:])+5)) 
plt.ylim((np.min(x[1,:])-5, np.max(x[1,:])+5))
plt.scatter(y[0,:], y[1,:], 
            c='r', alpha=0.5)
plt.plot(mu[0,:], mu[1,:], 
         c='b', marker='s', linewidth=1)
plt.show()
../../_images/8a7a55e8f71ae31a0a6234c89ad2492be2ff159bab74ea67e54336dd9ede9c72.png

To quantify the improvement in the inferred means compared to the observations, we compute the mean squared error in both cases.

dobs = x[0:1,:] - y[0:1,:]
mse_obs = np.sqrt(np.sum(dobs**2))
print(mse_obs)
22.891982252201856
dfilt = x[0:1,:] - mu[0:1,:]
mse_filt = np.sqrt(np.sum(dfilt**2))
print(mse_filt)
9.778610100463018

We indeed observe a substantial reduction.

Missing data We can also allow for the possibility that some observations are missing. Imagine for instance losing GPS signal while going through a tunnel. The recursions above are still valid, with the only modification that the Update equations involving \(\bY_t\) are dropped at those times \(t\) where there is no observation. In Numpy, we can use NaN to indicate the lack of observation. (Alternatively, one can use the numpy.ma module.)

We use a same sample path as above, but mask observations at times \(t=10,\ldots,20\).

ss = 4
os = 2
F = np.array([[1., 0., 1., 0.],
              [0., 1., 0., 1.],
              [0., 0., 1., 0.],
              [0., 0., 0., 1.]]) 
H = np.array([[1., 0., 0., 0.],
              [0., 1, 0., 0.]])
Q = 0.01 * np.diag(np.ones(ss))
R = 10 * np.diag(np.ones(os))
init_mu = np.array([0., 0., 1., 1.])
init_Sig = Q
T = 30
x, y = lgSamplePath(ss, os, F, H, Q, R, init_mu, init_Sig, T)
for i in range(10,20):
    y[0,i] = np.nan
    y[1,i] = np.nan

Here is the sample we are aiming to infer.

Hide code cell source
plt.plot(x[0,:], x[1,:], 
         marker='x', c='g', linestyle='dashed', alpha=0.5)
plt.xlim((np.min(x[0,:])-5, np.max(x[0,:])+5)) 
plt.ylim((np.min(x[1,:])-5, np.max(x[1,:])+5))
plt.plot(y[0,:], y[1,:], 
         c='r', marker='o', linestyle='dotted')
plt.show()
../../_images/e8974095e3db8c64bf97ad89b59cf4d99dc3a0a6e97e3150e344a01a7e01a1b0.png

We modify the recursion accordingly, that is, skip the Update step when there is no observation to use for the update.

def kalmanUpdate(ss, F, H, Q, R, y_t, mu_prev, Sig_prev):
    mu_pred = F @ mu_prev
    Sig_pred = F @ Sig_prev @ F.T + Q
    if np.isnan(y_t[0]) or np.isnan(y_t[1]):
        return mu_pred, Sig_pred
    else:
        e_t = y_t - H @ mu_pred
        S = H @ Sig_pred @ H.T + R
        Sinv = LA.inv(S)
        K = Sig_pred @ H.T @ Sinv
        mu_new = mu_pred + K @ e_t
        Sig_new = (np.diag(np.ones(ss)) - K @ H) @ Sig_pred
        return mu_new, Sig_new
init_mu = np.array([0., 0., 1., 1.])
init_Sig = 1 * np.diag(np.ones(ss))
mu, Sig = kalmanFilter(ss, os, y, F, H, Q, R, init_mu, init_Sig, T)
Hide code cell source
plt.scatter(x[0,:], x[1,:], 
         marker='x', c='g', alpha=0.5)
plt.xlim((np.min(x[0,:])-5, np.max(x[0,:])+5)) 
plt.ylim((np.min(x[1,:])-5, np.max(x[1,:])+5))
plt.scatter(y[0,:], y[1,:], 
            c='r', alpha=0.5)
plt.plot(mu[0,:], mu[1,:], 
         marker='s', linewidth=1)
plt.show()
../../_images/1e96346220f64d921a8b3ebbb1be2f415c000949bb6f031a16bc61d68d8a35a5.png