Limit behavior 2: convergence to equilibrium

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

7.4. Limit behavior 2: convergence to equilibrium#

We continue our study of the long-term behavior of a chain. Again, we restrict ourselves to finite-space discrete-time Markov chains that are also time-homogeneous.

7.4.1. Definitions#

Now that we have established the existence and uniqueness of a stationary distribution (at least in the irreducible case), it remains to justify its relevance. As we indicated before, the fixed-point nature of the stationary distribution definition suggests that it arises as a limit of repeatedly applying \(P\). Indeed it can be shown that, starting from any distribution, the state distribution at time \(t\) converges to the stationary distribution as \(t \to +\infty\) – under an additional assumption.

The additional assumption involves issues of periodicity. The following example will suffice to illustrate.

EXAMPLE: (A Periodic Chain) Consider a two-state Markov chain with transition probability

\[\begin{split} P =\begin{pmatrix} 0 & 1\\ 1 & 0 \end{pmatrix}. \end{split}\]

Note that this chain is irreducible. Its unique stationary distribution is \(\bpi = (1/2, 1/2)\) since indeed

\[\begin{split} \bpi P = (1/2, 1/2) \begin{pmatrix} 0 & 1\\ 1 & 0 \end{pmatrix} = (1/2, 1/2) = \bpi. \end{split}\]

We compute the distribution at time \(t\), started from state \(1\). That is,

\[ \bmu = (1, 0). \]

Then

\[ \bmu P = (0, 1), \]
\[ \bmu P^2 = (1,0), \]
\[ \bmu P^3 = (0,1), \]

and so on.

In general, by induction,

\[\begin{split} \bmu P^k = \begin{cases} (1,0) & \text{if $k$ is even}\\ (0,1) & \text{if $k$ is odd}. \end{cases} \end{split}\]

Clearly the distribution is not converging. Note however that the time average does

\[ \lim_{t \to +\infty} \frac{1}{t}\sum_{k \leq t} \bmu P^k = (1/2, 1/2) = \bpi. \]

\(\lhd\)

We will need the following definition.

DEFINITION (Aperiodic) \(\idx{aperiodic}\xdi\) Let \((X_t)_{t \geq 0}\) be a Markov chain over a finite state space \(\mathcal{S}\). We say that state \(i \in \mathcal{S}\) is aperiodic if

\[ \P[X_t = i\,|\,X_0 = i] > 0, \]

for all sufficiently large \(t\). A chain is aperiodic if all its states are. \(\natural\)

EXAMPLE: (A Periodic Chain, continued) Going back the two-state chain above, we note that neither state is aperiodic. For state \(1\), we have shown that

\[ \P[X_t = 1\,|\, X_0 = 1] = 0 \]

for all \(t\) odd. \(\lhd\)

We will not need to explore this definition in great details here. Instead we give a simple sufficient condition that is typically good enough for data science applications.

DEFINITION (Lazy) \(\idx{lazy}\xdi\) We say that a Markov chain \((X_t)_{t \geq 0}\) is lazy if every state \(i\) satisfies

\[ \P[X_1 = i\,|\,X_0 = i] > 0. \]

Put differently, all entries on the diagonal of the transition matrix are strictly positive. \(\natural\)

Graphically, the chain has self-loops\(\idx{self-loop}\xdi\) on each vertex. This terminology (which is not entirely standard) emphasizes the idea that the chain can “lazily” stay in its current state rather than always transitioning to a different one.

We check that a lazy chain is necessarily aperiodic. For any state \(i \in \mathcal{S}\) and any \(t\)

\[ \P[X_t = i\,|\,X_0 = i] \geq \prod_{s = 1}^t \P[X_{s} = i\,|\,X_{s-1} = i] = (\P[X_1 = i\,|\,X_0 = i])^t > 0, \]

by assumption. In words, the probability of being at \(i\) at time \(t\) given that we started at \(i\) is at least the probability that we never left.

EXAMPLE: Any Markov chain can be modified to be lazy by adding self-loops to all vertices. Specifically, let \(P = (p_{i,j})_{i,j} \in \mathbb{R}^{n \times n}\) be the transition matrix of a Markov chain on \([n]\). Consider the modified transition matrix \(Q = (q_{i,j})_{i,j}\) defined as

\[ Q = \frac{1}{2}(I_{n \times n} + P). \]

We check that this is indeed a stochastic matrix. For each \(i, j \in [n]\), we have

\[ q_{i,j} = \frac{1}{2} \mathbf{1}_{i = j} + \frac{1}{2} p_{i,j} \geq \frac{1}{2} p_{i,j} \geq 0 \]

and

\[ \sum_{\ell=1}^n q_{i,\ell} = \sum_{\ell=1}^n \left[\frac{1}{2} \mathbf{1}_{i = \ell} + \frac{1}{2} p_{i,\ell}\right] = \frac{1}{2} \sum_{\ell=1}^n \mathbf{1}_{i = \ell} + \frac{1}{2} \sum_{\ell=1}^n p_{i,\ell} = \frac{1}{2} + \frac{1}{2} = 1. \]

Finally, we note that \(Q\) is lazy since

\[ q_{i,i} = \frac{1}{2} \mathbf{1}_{i = i} + \frac{1}{2} p_{i,i} = \frac{1}{2} + \frac{1}{2} p_{i,i} \geq \frac{1}{2} > 0. \]

The chain \(Q\) is often referred to as the “lazy version” of \(P\). \(\lhd\)

7.4.2. Convergence theorems#

We are ready to state two key theorems.

First, the Convergence to Equilibrium Theorem applies to irreducible and aperiodic chains. Such chains have a unique stationary distribution. The theorem says that, started from any initial distribution, the distribution at time \(t\) converges to the stationary distribution as \(t \to +\infty\).

THEOREM (Convergence to Equilibrium) \(\idx{convergence to equilibrium theorem}\xdi\) Let \((X_t)_{t\geq 0}\) be an irreducible and aperiodic Markov chain with unique stationary distribution \(\bpi\). Then for any initial distribution \(\bmu\) and any state \(i\)

\[ \P[X_t = i] \to \pi_i, \]

as \(t \to +\infty\). \(\sharp\)

Put differently, this theorem should look familiar. Using the formula \((\bmu P^t)_i\) for the probability that the state is \(i\) at time \(t\), we can rephrase the theorem in matrix form as

\[ \bmu P^t \to \bpi, \]

as \(t \to +\infty\). This is highly reminiscent of the Power Iteration Lemma. Here repeated multiplication by \(P\) starting from an arbitrary distribution converges to \(\bpi\), which recall is a left eigenvector of \(P\) with eigenvalue \(1\). Unlike the Power Iteration Lemma, there is no need to normalize here. This is because we implicitly work with the \(\ell_1\)-norm and \(P\) is stochastic, thereby preserving the norm.

Second, the Ergodic Theorem applies to irreducible (but not necessarily aperiodic) chains. Again such have a unique stationary distribution. The theorem says that, started from any initial distribution, the frequency of visits to any state \(i\) converges to the stationary distribution. Below, we use the notation \(\mathbf{1}[X_s = i]\) which is \(1\) when \(X_s = i\) and \(0\) otherwise.

THEOREM (Ergodic) \(\idx{ergodic theorem}\xdi\) Let \((X_t)_{t\geq 0}\) be an irreducible Markov chain with unique stationary distribution \(\bpi\). Then for any initial distribution \(\bmu\) and any state \(i\)

\[ \frac{1}{t} \sum_{s = 0}^t \mathbf{1}[X_s = i] \to \pi_i, \]

as \(t \to +\infty\). \(\sharp\)

Above, \(\sum_{s = 0}^t \mathbf{1}[X_s = i]\) is the number of visits to \(i\) up to time \(t\).

Note that neither of these two theorems immediately implies the other. They are both useful in their own way.

NUMERICAL CORNER: The Convergence to Equilibrium Theorem implies that we can use power iteration to compute the unique stationary diistribution in the irreducible case. We revisit the Robot Vaccum Example. We initialize with the uniform distribution, then repeatedly multiply by \(P\).

P_robot = np.array([[0, 0.8, 0, 0.2, 0, 0, 0, 0, 0],
                    [0.3, 0, 0.2, 0, 0, 0.5, 0, 0, 0],
                    [0, 0.6, 0, 0, 0, 0.4, 0, 0, 0],
                    [0.1, 0.1, 0, 0, 0.8, 0, 0, 0, 0],
                    [0, 0, 0, 0.25, 0, 0, 0.75, 0, 0],
                    [0, 0.15, 0.15, 0, 0, 0, 0, 0.35, 0.35],
                    [0, 0, 0, 0, 0, 0, 0, 1, 0],
                    [0, 0, 0, 0, 0.3, 0.4, 0.2, 0, 0.1],
                    [0, 0, 0, 0, 0, 1, 0, 0, 0]])
n_robot = P_robot.shape[0]
mu = np.ones(n_robot)/n_robot
print(mu)
[0.11111111 0.11111111 0.11111111 0.11111111 0.11111111 0.11111111
 0.11111111 0.11111111 0.11111111]
mu = mu @ P_robot
print(mu)
[0.04444444 0.18333333 0.03888889 0.05       0.12222222 0.25555556
 0.10555556 0.15       0.05      ]
mu = mu @ P_robot
print(mu)
[0.06       0.10222222 0.075      0.03944444 0.085      0.21722222
 0.12166667 0.195      0.10444444]

We repeat, say, \(10\) more times and compare to the truth pi_robot.

for _ in range(10):
    mu = mu @ P_robot
print(mu)
[0.0358112  0.10982018 0.06297235 0.02721311 0.08055026 0.27393441
 0.09944157 0.19521946 0.11503747]
w, v = LA.eig(P_robot.T)
pi_robot = np.real(v[:,0]) / np.sum(np.real(v[:,0]))
print(pi_robot)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953  0.27369992
 0.09922559 0.19502056 0.11529703]

We see that a small number of iterations sufficed to get an accurate answer. In general, the speed of convergence depends on the eigenvalues of \(P\) that are strictly smaller than \(1\) in absolute value.

We can also check the Ergodic Theorem through simulation. We generate a long sample path and compare the state visit frequencies to pi_robot.

seed = 535
rng = np.random.default_rng(seed)

mu = np.ones(n_robot) / n_robot
path_length = 50000
visit_freq = np.zeros(n_robot)

path = mmids.SamplePath(rng, mu, P_robot, path_length)
for i in range(n_robot):
    visit_freq[i] = np.count_nonzero(path == i+1)/(path_length+1)

print(visit_freq)
[0.03627927 0.10927781 0.0601788  0.02645947 0.08045839 0.27359453
 0.10119798 0.19693606 0.11561769]
print(pi_robot)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953  0.27369992
 0.09922559 0.19502056 0.11529703]

\(\unlhd\)

CHAT & LEARN The mixing time is an important quantity in the study of Markov chains. Ask your favorite AI chatbot to define this concept and discuss its relevance to the convergence of Markov chains. Explore some bounds on the mixing time for specific classes of Markov chains. \(\ddagger\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 Which of the following is a sufficient condition for a Markov chain to be aperiodic?

a) The chain is irreducible.

b) The chain has a unique stationary distribution.

c) The chain is lazy.

d) The chain has a finite state space.

2 In a Markov chain with state space \(S\) and transition matrix \(P\), what does it mean if the chain is lazy?

a) \(\mathbb{E}[X_t] = 0\) for all \(t\).

b) \(P_{ii} > 0\) for all \(i \in S\).

c) \(P_{ij} = 0\) for all \(i \ne j\).

d) \(\mathrm{Var}(X_t) = 1\) for all \(t\).

3 The Convergence to Equilibrium Theorem applies to which type of Markov chains?

a) Irreducible and aperiodic chains

b) Irreducible and periodic chains

c) Reducible and aperiodic chains

d) Reducible and periodic chains

4 The Ergodic Theorem applies to which type of Markov chains?

a) Irreducible chains

b) Aperiodic chains

c) Reducible chains

d) Periodic chains

5 Which of the following describes a key difference between the Convergence to Equilibrium Theorem and the Ergodic Theorem?

a) The Convergence to Equilibrium Theorem applies to periodic chains, while the Ergodic Theorem applies to aperiodic chains.

b) The Convergence to Equilibrium Theorem concerns the state distribution, while the Ergodic Theorem concerns the frequency of state visits.

c) The Convergence to Equilibrium Theorem requires a diagonal transition matrix, while the Ergodic Theorem does not.

d) The Convergence to Equilibrium Theorem is applicable only to finite Markov chains, while the Ergodic Theorem is not.

Answer for 1: c. Justification: The text states, “We check that a lazy chain is necessarily aperiodic.”

Answer for 2: b. Justification: A weakly lazy Markov chain is defined as having all diagonal entries of the transition matrix strictly positive, i.e., \(P_{ii} > 0\) for all \(i \in S\).

Answer for 3: a. Justification: The text states, “First, the Convergence to Equilibrium Theorem applies to irreducible and aperiodic chains.”

Answer for 4: a. Justification: The text states, “Second, the Ergodic Theorem applies to irreducible (but not necessarily aperiodic) chains.”

Answer for 5: b. Justification: The Convergence to Equilibrium Theorem addresses the convergence of the state distribution to the stationary distribution, while the Ergodic Theorem focuses on the frequency of visits to any state converging to the stationary distribution.