More advanced material: spectral techniques for random walks on graphs; existence of a stationary distribution

\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\S}{\mathcal{S}}\) \(\newcommand{\indep}{\perp\!\!\!\perp}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\)

5.6. More advanced material: spectral techniques for random walks on graphs; existence of a stationary distribution#

We cover a few advanced topics.

5.6.1. Spectral techniques for random walks on graphs#

In this section, we use techniques from spectral graph theory to analyze random walks on graphs.

Applying the spectral theorem via the normalized Laplacian We have seen how to compute the unique stationary distribution \(\bpi\) of random walk on a connected weighted (undirected) graph. Recall that \(\bpi\) is a (left) eigenvector of \(P\) with eigenvalue \(1\) (i.e., \(\bpi P = \bpi\)). In general, however, the matrix \(P\) is not symmetric in this case (see the previous example) - even though the adjacency matrix is. So we cannot apply the spectral theorem to get the rest of the eigenvectors - if they even exist. But, remarkably, a symmetric matrix with the a closely related spectral decomposition is hiding in the background.

Recall that the normalized Laplacian of a weighted graph \(G = (V,E,w)\) with adjacency matrix \(A\) and degree matrix \(D\) is defined as

\[ \mathcal{L} = I - D^{-1/2} A D^{-1/2}. \]

Recall that in the weighted case, the degree is defined as \(\delta(i) = \sum_{j:\{i,j\} \in E} w_{i,j}\). Because it is symmetric and positive semi-definite, we can write

\[ \mathcal{L} = \sum_{i=1}^n \eta_i \mathbf{z}_i \mathbf{z}_i^T, \]

where the \(\mathbf{z}_i\)s are orthonormal eigenvectors of \(\mathcal{L}\) and the eigenvalues satisfy \(0 \leq \eta_1 \leq \eta_2 \leq \cdots \leq \eta_n\).

Moreover, \(D^{1/2} \mathbf{1}\) is an eigenvector of \(\mathcal{L}\) with eigenvalue \(0\). So \(\eta_1 = 0\) and we set

\[ (\mathbf{z}_1)_i = \left(\frac{D^{1/2} \mathbf{1}}{\|D^{1/2} \mathbf{1}\|_2}\right)_i = \sqrt{\frac{\delta(i)}{\sum_{i\in V} \delta(i)}}, \quad \forall i \in [n], \]

which makes \(\mathbf{z}_1\) into a unit norm vector.

We return to the eigenvectors of \(P\). When a matrix \(A \in \mathbb{R}^{n \times n}\) is diagonalizable, it has an eigendecomposition of the form

\[ A = Q \Lambda Q^{-1}, \]

where \(\Lambda\) is a diagonal matrix whose diagonal entries are the eigenvalues of \(A\). The columns of \(Q\) are the eigenvectors of \(A\) and they form a basis of \(\mathbb{R}^n\). Unlike the symmetric case, however, the eigenvectors need not be orthogonal.

THEOREM (Eigendecomposition of Random Walk on a Graph) Let \(G = (V,E,w)\) be a graph with positive edge weights and no isolated vertex, and with degree matrix \(D\). Let \(P \in \mathbb{R}^{n \times n}\) be the transition matrix of random walk on \(G\). Let \(\mathbf{z}_1,\ldots,\mathbf{z}_n \in \mathbb{R}^n\) and \(0 \leq \eta_1 \leq \cdots \eta_n \leq 2\) be the eigenvectors and eigenvalues of the normalized Laplacian. Then \(P\) has the following eigendecomposition

\[\begin{align*} P &= (D^{-1/2} Z)(I - H) (D^{-1/2} Z)^{-1} \end{align*}\]

where the columns of \(Z\) are the \(\mathbf{z}_i\)s and \(H\) is a diagonal matrix with the \(\eta_i\)s on its diagonal. This can also be written as

\[\begin{align*} P &= \mathbf{1}\bpi + \sum_{i=2}^n \lambda_i D^{-1/2} \mathbf{z}_i \mathbf{z}_i^T D^{1/2}, \end{align*}\]

where

\[ \pi_i = \frac{\delta(i)}{\sum_{j\in V} \delta(j)} \quad \text{and} \quad \lambda_i = 1- \eta_i, \qquad i =1,\ldots,n. \]

\(\sharp\)

Proof: We write \(\mathcal{L}\) in terms of \(P\). Recall that \(P = D^{-1} A\). Hence

\[ \mathcal{L} = I - D^{1/2} P D^{-1/2}. \]

Rearranging this becomes

\[ P = I - D^{-1/2} \mathcal{L} D^{1/2}. \]

Hence for all \(i=1,\ldots,n\)

\[\begin{align*} P (D^{-1/2} \mathbf{z}_i) &= (I - D^{-1/2} \mathcal{L} D^{1/2})\,(D^{-1/2} \mathbf{z}_i)\\ &= (D^{-1/2} \mathbf{z}_i) - D^{-1/2} \mathcal{L} D^{1/2} (D^{-1/2} \mathbf{z}_i)\\ &= (D^{-1/2} \mathbf{z}_i) - D^{-1/2} \mathcal{L} \mathbf{z}_i\\ &= (D^{-1/2} \mathbf{z}_i) - D^{-1/2} \eta_i \mathbf{z}_i\\ &= (1 - \eta_i) (D^{-1/2} \mathbf{z}_i). \end{align*}\]

Because \(P\) is a transition matrix, all its eigenvalues are bounded in absolute value by \(1\). So \(|1-\eta_i|\leq 1\), which implies \(0 \leq \eta_i \leq 2\).

We also note that

\[ (D^{-1/2} Z) (D^{1/2} Z)^T = D^{-1/2} Z Z^T D^{1/2} = D^{-1/2} D^{1/2} = I, \]

by the orthonormality of the eigenvectors of the normalized Laplacian. So the columns of \(D^{-1/2} Z\), i.e, \(D^{-1/2} \mathbf{z}_i\) for \(i=1,\ldots,n\), are linearly independent and

\[ (D^{-1/2} Z)^{-1} = (D^{1/2} Z)^T. \]

That gives the first claim.

To get the second claim, we first note that (for similar calculations, see the definition of an SVD)

\[\begin{align*} P &= (D^{-1/2} Z)(I - H) (D^{-1/2} Z)^{-1}\\ &= (D^{-1/2} Z)(I - H) (D^{1/2} Z)^T\\ &= D^{-1/2} Z (I - H) Z^T D^{1/2}\\ &= \sum_{i=1}^n \lambda_i D^{-1/2} \mathbf{z}_i \mathbf{z}_i^T D^{1/2}, \end{align*}\]

where \(\lambda_i = 1- \eta_i\).

We then use the expression for \(\mathbf{z}_1\) above. We have

\[ D^{-1/2} \mathbf{z}_1 = D^{-1/2} \frac{D^{1/2}\mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2} = \frac{\mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2}, \]

while

\[ \mathbf{z}_i^T D^{1/2} = (D^{1/2} \mathbf{z}_1)^T = \left(D^{1/2} \frac{D^{1/2}\mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2}\right)^T = \left(\frac{D \mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2}\right)^T. \]

So

\[\begin{align*} D^{-1/2} \mathbf{z}_1 \mathbf{z}_1^T D^{1/2} &= \frac{\mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2} \left(\frac{D \mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2}\right)^T\\ &= \mathbf{1} \left(\frac{D \mathbf{1}}{\|D^{1/2}\mathbf{1}\|_2^2}\right)^T\\ &= \mathbf{1} \left(\frac{D \mathbf{1}}{\|D \mathbf{1}\|_1}\right)^T\\ &= \mathbf{1} \bpi. \end{align*}\]

That proves the second claim. \(\square\)

If \(G\) is connected and \(w_{i,i} > 0\) for all \(i\), then the chain is irreducible and weakly lazy. In that case, there is a unique eigenvalue \(1\) and \(-1\) is not an eigenvalue, so we must have \(0 < \eta_2 \leq \cdots \leq \eta_n < 2\).

Limit theorems revisited The Convergence to Equilibrium Theorem implies that in the irreducible, aperiodic case

\[ \bmu P^t \to \bpi, \]

as \(t \to +\infty\), for any initial distribution \(\bmu\) and the unique stationary distribution \(\bpi\). Here we give a simpler proof for random walk on a graph (or more generally a reversible chain), with the added bonus of a convergence rate. This follows from the same argument we used in the Power Iteration Lemma.

DEFINITION (Spectral Gap) Let \(G = (V,E,w)\) be a graph with positive edge weights and no isolated vertex. Let \(P \in \mathbb{R}^{n \times n}\) be the transition matrix of random walk on \(G\). The absolute spectral gap of \(G\) is defined as \(\gamma_{\star} = 1 - \lambda_{\star}\) where

\[ \lambda_{\star} = \max\{|\lambda|\,:\, \text{$\lambda$ is an eigenvalue of $P$, $\lambda \neq 1$} \}. \]

\(\natural\)

THEOREM (Convergence to Equilibrium: Reversible Case) Let \(G = (V,E,w)\) be a connected graph with positive edge weights and \(w_{x,x} > 0\) for all \(x \in V\). Let \(P \in \mathbb{R}^{n \times n}\) be the transition matrix of random walk on \(G\) and \(\bpi\) its unique stationary distribution. Then

\[ \bmu P^t \to \bpi, \]

as \(t \to +\infty\) for any initial distribution \(\bmu\). Moreover,

\[ \left|P^t_{x,y} - \pi_y\right| \leq \gamma_\star^t \sqrt{\frac{\bar{\delta}}{\underline{\delta}}}, \]

where \(\bar{\delta} = \max_x \delta(x)\), \(\underline{\delta} = \min_x \delta(x)\) and \(\gamma_\star\) is the absolute spectral gap. \(\sharp\)

Proof: Similarly to the Power Iteration Lemma, using the Eigendecomposition of Random Walk on a Graph we get

\[\begin{align*} P^2 &= (D^{-1/2} Z)(I - H) (D^{-1/2} Z)^{-1}(D^{-1/2} Z)(I - H) (D^{-1/2} Z)^{-1}\\ &= (D^{-1/2} Z)(I - H)^2 (D^{-1/2} Z)^{-1}. \end{align*}\]

By induction,

\[\begin{align*} P^t &= (D^{-1/2} Z)(I - H)^t (D^{-1/2} Z)^{-1}\\ &= \sum_{i=1}^n \lambda_i^t (D^{-1/2} \mathbf{z}_i) (D^{1/2} \mathbf{z}_i)^T\\ &= \mathbf{1} \bpi + \sum_{i=2}^n \lambda_i^t (D^{-1/2} \mathbf{z}_i) (D^{1/2} \mathbf{z}_i)^T, \end{align*}\]

by calculations similar to the proof of the Eigendecomposition of Random Walk on a Graph.

In the irreducible, weakly lazy case, for \(i=2,\ldots,n\), \(\lambda_i^t \to 0\) as \(t \to +\infty\).

Moreover, \(|\lambda_i| \leq (1-\gamma_\star)\) for all \(i=2,\ldots,n\). Hence,

\[\begin{align*} \left|P^t_{x,y} - \pi_y\right| &= \sum_{i=2}^n \lambda_i^t \delta(x)^{-1/2}(\mathbf{z}_i)_x (\mathbf{z}_i)_y \delta(y)^{1/2}\\ &\leq (1-\gamma_\star)^t \sqrt{\frac{\delta(y)}{\delta(x)}} \sum_{i=2}^n |(\mathbf{z}_i)_x (\mathbf{z}_i)_y|. \end{align*}\]

We then use Cauchy-Schwarz and the fact that \(Z Z^T = I\) (as \(Z\) is an orthogonal matrix), which implies \(\sum_{i=1}^n (\mathbf{z}_i)_x^2 = 1\).

We get that the above is

\[\begin{align*} &\leq (1-\gamma_\star)^t \sqrt{\frac{\delta(y)}{\delta(x)}} \sum_{i=2}^n (\mathbf{z}_i)_x^2 \sum_{i=2}^n (\mathbf{z}_i)_y^2\\ &\leq (1-\gamma_\star)^t \sqrt{\frac{\delta(y)}{\delta(x)}}\\ &\leq (1-\gamma_\star)^t \sqrt{\frac{\bar{\delta}}{\underline{\delta}}}. \end{align*}\]

\(\square\)

We record an immediate corollary that will be useful next. Let \(f : V \to \mathbb{R}\) be a function over the vertices. Define the (column) vector \(\mathbf{f} = (f(1),\ldots,f(n))^T\) and note that

\[ \bpi \mathbf{f} = \sum_{x \in V} \pi_x f(x). \]

It will be convenient to use to \(\ell_\infty\)-norm. For a vector \(\mathbf{x} = (x_1,\ldots,x_n)^T\), we let \(\|\mathbf{x}\|_\infty = \max_{i \in [n]} |x_i|\).

THEOREM For any initial distribution \(\bmu\) and any \(t\)

\[\begin{align*} \left|\,\E[f(X_t)] - \bpi \mathbf{f}\,\right| \leq (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty. \end{align*}\]

\(\sharp\)

Proof: By the Time Marginals Theorem,

\[\begin{align*} \left|\,\E[f(X_t)] - \bpi \mathbf{f}\,\right| &= \left|\,\sum_{x} \sum_y \mu_x (P^t)_{x,y} f(y) - \sum_{y} \pi_y f(y)\,\right|. \end{align*}\]

Because \(\sum_{x} \mu_x = 1\), the right-hand side is

\[\begin{align*} &= \left|\,\sum_{x} \sum_y \mu_x (P^t)_{x,y} f(y) - \sum_x \sum_{y} \mu_x \pi_y f(y)\,\right|\\ &\leq \sum_{x} \mu_x \sum_y \left| (P^t)_{x,y} - \pi_y \right| |f(y)|, \end{align*}\]

by the triangle inequality.

Now by the theorem this is

\[\begin{align*} &\leq \sum_{x} \mu_x \sum_y (1-\gamma_\star)^t \frac{\pi_y}{\pi_{\min{}}}|f(y)|\\ &= (1-\gamma_\star)^t \frac{1}{\pi_{\min{}}} \sum_{x} \mu_x \sum_y \pi_y |f(y)|\\ &\leq (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty. \end{align*}\]

That proves the claim. \(\square\)

We also prove a version of the Ergodic Theorem.

THEOREM (Ergodic Theorem: Reversible Case) Let \(G = (V,E,w)\) be a connected graph with positive edge weights and \(w_{x,x} > 0\) for all \(x \in V\). Let \(P \in \mathbb{R}^{n \times n}\) be the transition matrix of random walk on \(G\) and \(\bpi\) its unique stationary distribution. Let \(f : V \to \mathbb{R}\) be a function over the vertices. Then for any initial distribution \(\bmu\)

\[ \frac{1}{T} \sum_{t=1}^T f(X_{t}) \to \sum_{x \in V} \pi_x f(x), \]

in probability as \(T \to +\infty\). \(\sharp\)

Proof idea: We use Chebyshev’s Inequality. By the Convergence Theorem: Reversible Case, the expectation converges to the limit. To bound the variance, we use the Eigendecomposition of Random Walk on a Graph.

Proof: We use Chebyshev’s Inequality, similarly to our proof of the Weak Law of Large Numbers. However, unlike that case, here the terms in the sum are not independent are require some finessing. Define again the (column) vector \(\mathbf{f} = (f(1),\ldots,f(n))\). Then the limit can be written as

\[ \sum_{x \in V} \pi_x f(x) = \bpi \mathbf{f}. \]

By the corollary, the expectation of the sum can be bounded as follows

\[\begin{align*} \left|\,\E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] - \bpi \mathbf{f}\,\right| &\leq \frac{1}{T} \sum_{t=1}^T\left|\E[f(X_{t})] - \bpi \mathbf{f}\right|\\ &\leq \frac{1}{T} \sum_{t=1}^T (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty\\ &\leq \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty \frac{1}{T} \sum_{t=0}^{+\infty} (1-\gamma_\star)^t\\ &= \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty \gamma_\star^{-1}\frac{1}{T} \to 0 \end{align*}\]

as \(T \to +\infty\).

Next we bound the variance of the sum. By the Variance of a Sum,

\[ \mathrm{Var}\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] = \frac{1}{T^2} \sum_{t=1}^T \mathrm{Var}[f(X_{t})] + \frac{2}{T^2} \sum_{1 \leq s < t\leq T} \mathrm{Cov}[f(X_{s}),f(X_{t})]. \]

We bound the variance and covariance separately using the Eigendecomposition of Random Walk on a Graph.

To obtain convergence, a trivial bound on the variance suffices. Then

\[ 0 \leq \mathrm{Var}[f(X_{t})] \leq \E[f(X_{t})^2] \leq \|\mathbf{f}\|_\infty^2. \]

Hence,

\[ 0 \leq \frac{1}{T^2} \sum_{t=1}^T \mathrm{Var}[f(X_{t})] \leq \frac{T \|\mathbf{f}\|_\infty^2}{T^2} \to 0, \]

as \(T \to +\infty\).

Bounding the covariance requires a more delicate argument. Fix \(1 \leq s < t\leq T\). The trick is to condition on \(X_s\) and use the Markov Property. By definition of the covariance and the Law of Total Expectation,

\[\begin{align*} &\mathrm{Cov}[f(X_{s}),f(X_{t})]\\ &= \E\left[(f(X_{s}) - \E[f(X_{s})]) (f(X_{t}) - \E[f(X_{t})])\right]\\ &= \sum_{x} \E\left[(f(X_{s}) - \E[f(X_{s})]) (f(X_{t}) - \E[f(X_{t})])\,\middle|\,X_s = x\right]\,\P[X_s = x]\\ &= \sum_{x} \E\left[f(X_{t}) - \E[f(X_{t})]\,\middle|\,X_s = x\right](f(x) - \E[f(X_{s})]) \,\P[X_s = x]. \end{align*}\]

We now use the time-homogeneity of the chain to note that

\[\begin{align*} &\E\left[f(X_{t}) - \E[f(X_{t})]\,\middle|\,X_s = x\right]\\ &= \E\left[f(X_{t})\,\middle|\,X_0 = x\right] - \E[f(X_{t})]\\ &= \E\left[f(X_{t-s})\,\middle|\,X_0 = x\right] - \E[f(X_{t})]. \end{align*}\]

We now use the corollary. This expression in absolute value is

\[\begin{align*} &\left|\E\left[f(X_{t}) - \E[f(X_{t})]\,\middle|\,X_s = x\right]\right|\\ &= \left|\E\left[f(X_{t-s})\,\middle|\,X_0 = x\right] - \E[f(X_{t})]\right|\\ &= \left|(\E\left[f(X_{t-s})\,\middle|\,X_0 = x\right] - \bpi \mathbf{f}) - (\E[f(X_{t})] - \bpi \mathbf{f})\right|\\ &\leq \left|\E\left[f(X_{t-s})\,\middle|\,X_0 = x\right] - \bpi \mathbf{f}\right| + \left|\E[f(X_{t})] - \bpi \mathbf{f}\right|\\ &\leq (1-\gamma_\star)^{t-s} \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty + (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty. \end{align*}\]

Plugging back above,

\[\begin{align*} &\left|\mathrm{Cov}[f(X_{s}),f(X_{t})]\right|\\ &\leq \sum_{x} \left|\E\left[f(X_{t}) - \E[f(X_{t})]\,\middle|\,X_s = x\right]\right| \left|f(x) - \E[f(X_{s})]\right| \,\P[X_s = x]\\ &\leq \sum_{x} ((1-\gamma_\star)^{t-s} \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty + (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty) \left|f(x) - \E[f(X_{s})]\right| \,\P[X_s = x]\\ &\leq ((1-\gamma_\star)^{t-s} \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty + (1-\gamma_\star)^t \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty) \sum_{x} 2 \|\mathbf{f}\|_\infty\P[X_s = x]\\ &\leq 4 (1-\gamma_\star)^{t-s} \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2, \end{align*}\]

where we used that \((1-\gamma_\star)^t \leq (1-\gamma_\star)^{t-s}\) since \((1-\gamma_\star) < 1\) and \(t -s \leq t\).

Returning to the sum over the covariances, the previous bound gives

\[\begin{align*} &\left|\frac{2}{T^2} \sum_{1 \leq s < t\leq T} \mathrm{Cov}[f(X_{s}),f(X_{t})]\right|\\ &\leq \frac{2}{T^2} \sum_{1 \leq s < t\leq T} \left|\mathrm{Cov}[f(X_{s}),f(X_{t})]\right|\\ &\leq \frac{2}{T^2} \sum_{1 \leq s < t\leq T} 4 (1-\gamma_\star)^{t-s} \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2. \end{align*}\]

To evaluate the sum we make the change of variable \(h = t - s\) to get that the previous expression is

\[\begin{align*} &\leq 4 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2\frac{2}{T^2} \sum_{1 \leq s \leq T} \sum_{h=1}^{T-s} (1-\gamma_\star)^{h}\\ &\leq 4 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2\frac{2}{T^2} \sum_{1 \leq s \leq T} \sum_{h=0}^{+\infty} (1-\gamma_\star)^{h}\\ &= 4 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2\frac{2}{T^2} \sum_{1 \leq s \leq T} \frac{1}{\gamma_\star}\\ &= 8 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2 \gamma_\star^{-1} \frac{1}{T} \to 0, \end{align*}\]

as \(T \to +\infty\).

We have shown that

\[\begin{align*} \mathrm{Var}\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] \leq \|\mathbf{f}\|_\infty^2 \frac{1}{T} + 8 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2 \gamma_\star^{-1} \frac{1}{T} \leq 9 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2 \gamma_\star^{-1} \frac{1}{T}. \end{align*}\]

For any \(\varepsilon > 0\)

\[\begin{align*} &\P\left[\left|\,\frac{1}{T} \sum_{t=1}^T f(X_{t}) - \bpi \mathbf{f}\,\right| \geq \varepsilon \right]\\ &= \P\left[\left|\,\frac{1}{T} \sum_{t=1}^T f(X_{t}) - \E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] + \left(\E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] - \bpi \mathbf{f} \right)\,\right| \geq \varepsilon \right]\\ &\leq \P\left[\left|\,\frac{1}{T} \sum_{t=1}^T f(X_{t}) - \E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right]\,\right| + \left|\,\E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right] - \bpi \mathbf{f} \,\right| \geq \varepsilon \right]\\ &\leq \P\left[\left|\,\frac{1}{T} \sum_{t=1}^T f(X_{t}) - \E\left[\frac{1}{T} \sum_{t=1}^T f(X_{t})\right]\,\right| \geq \varepsilon - \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty \gamma_\star^{-1}\frac{1}{T}\right]. \end{align*}\]

We can now apply Chebyshev to get

\[\begin{align*} \P\left[\left|\,\frac{1}{T} \sum_{t=1}^T f(X_{t}) - \bpi \mathbf{f}\,\right| \geq \varepsilon \right] &\leq \frac{9 \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty^2 \gamma_\star^{-1} \frac{1}{T}}{(\varepsilon - \pi_{\min{}}^{-1} \|\mathbf{f}\|_\infty \gamma_\star^{-1}\frac{1}{T})^2} \to 0, \end{align*}\]

as \(T \to +\infty\). \(\square\)

5.6.2. Additional proofs#

We give a few proofs that were omitted in the chapter.

Proof of existence theorem We prove the existence of a stationary distribution.

Proof of Existence of Stationary Distribution Theorem: The proof is not straightforward. We make a series of claims to establish existence.

LEMMA (Step 1) There is a non-zero row vector \(\mathbf{z} \in \mathbb{R}^n\) such that \(\mathbf{z} P = \mathbf{z}\). \(\flat\)

LEMMA (Step 2) Let \(\mathbf{z} \in \mathbb{R}^n\) be a non-zero row vector \(\mathbf{z} P = \mathbf{z}\). Then

\[ \bpi = \frac{\mathbf{z}}{\sum_{x} z_x} \]

is a strictly positive stationary distribution of \(P\). \(\flat\)

LEMMA (Step 3) Let \(\bpi_1\) and \(\bpi_2\) be stationary distributions of \(P\). Then \(\bpi_1 = \bpi_2\). \(\flat\)

Proof: (Lemma (Step 1)) Because \(P\) is stochastic, we have by definition that \(P \mathbf{1} = \mathbf{1}\). Put differently,

\[ (P - I) \mathbf{1} = \mathbf{0} \]

that is, the columns of \(P - I\) are linearly dependent. In particular \(\mathrm{rk}(P-I) < n\). That in turn implies that the rows of \(P - I\) are linearly dependent by the Row Rank Equals Column Rank Theorem. So there exists a non-zero row vector \(\mathbf{z} \in \mathbb{R}^n\) such that \(\mathbf{z}(P-I) = \mathbf{0}\), or after rearranging

\[ \mathbf{z}P = \mathbf{z}. \]

That proves the claim. \(\square\)

Proof: (Lemma (Step 2)) We break up the proof into several claims.

To take advantage of irreducibility, we first construct a positive stochastic matrix with \(\mathbf{z}\) as a left eigenvector of eigenvalue \(1\). We then show that all entries of \(\mathbf{z}\) have the same sign. Finally, we normalize \(\mathbf{z}\).

LEMMA (Step 2a) There exists a non-negative integer \(h\) such that

\[ R = \frac{1}{h+1}[I + P + P^2 + \cdots + P^h] \]

has only strictly positive entries and satisfies \(\mathbf{z} R = \mathbf{z}\). \(\flat\)

LEMMA (Step 2b) The entries of \(\mathbf{z}\) are either all non-negative or all non-positive. \(\flat\)

LEMMA (Step 2c) Let \(\bpi = \frac{\mathbf{z}}{\mathbf{z}\mathbf{1}}\). Then \(\bpi\) is a strictly positive stationary distribution. \(\flat\)

We prove the claims next.

Proof: (Lemma (Step 2a)) By irreducibility and the Communication Lemma, for any \(x, y \in [n]\) there is \(h_{xy}\) such that \((P^{h_{xy}})_{x,y} > 0\). Now define

\[ h = \max_{x,y \in [n]} h_{xy}. \]

It can be shown (Try it!) that \(P^s\) (as a product of stochastic matrices) is itself a stochastic matrix for all \(s\). In particular, it has nonnegative entries. Hence, for each \(x,y\),

\[ R_{x,y} = \frac{1}{h+1}[I_{x,y} + P_{x,y} + (P^2)_{x,y} + \cdots + (P^h)_{x,y}] \geq \frac{1}{h+1} (P^{h_{x,y}})_{x,y} > 0. \]

It can be shown (Try it!) that \(R\) (as a convex combination of stochastic matrices) is itself a stochastic matrix.

Moreover, by the Stationarity Lemma, since \(\mathbf{z} P = \mathbf{z}\) it follows that \(\mathbf{z} P^s = \mathbf{z}\) for all \(s\). Therefore,

\[ \mathbf{z} R = \frac{1}{h+1}[\mathbf{z}I + \mathbf{z}P + \mathbf{z}P^2 + \cdots + \mathbf{z}P^h] = \frac{1}{h+1}[\mathbf{z} + \mathbf{z} + \mathbf{z} + \cdots + \mathbf{z}] = \mathbf{z}. \]

That concludes the proof. \(\square\)

Proof: (Lemma (Step 2b)) We argue by contradiction. Suppose that two entries of \(\mathbf{z} = (z_x)_{x \in [n]}\) have different signs. Say \(z_i > 0\) while \(z_j < 0\). Let \(R = (r_{x,y})_{x,y=1}^n\). By the previous claim, \(r_{x,y} > 0\) for all \(x,y\), therefore

\[ |z_y| = \left|\sum_{x} z_x r_{x,y}\right| = \left|\sum_{x: z_x \geq 0} z_x r_{x,y} + \sum_{x: z_x < 0} z_x r_{x,y}\right|. \]

The first term on the right-hand side is strictly positive (since it is at least \(z_i r_{i,y} > 0\)) while the second term is strictly negative (since it is at most \(z_j r_{j,y} < 0\)). Hence, because of cancellations, the expression above is strictly smaller than the sum of the absolute values

\[ |z_y| < \sum_{x} |z_x| r_{x,y}. \]

Since \(R\) is stochastic by the previous claim, we deduce after summing over \(y\)

\[ \sum_{y} |z_y| < \sum_{y} \sum_{x} |z_x| r_{x,y} = \sum_{x} |z_x| \sum_{y} r_{x,y} = \sum_{x} |z_x|, \]

a contradiction, proving the claim. \(\square\)

Proof: (Lemma (Step 2c)) Now define \(\bpi = (\pi_x)_{x \in [n]}\) as

\[ \pi_ x = \frac{z_x}{\sum_{i} z_i} = \frac{|z_x|}{\sum_{i} |z_i|} \geq 0, \]

where the second equality comes from the previous claim. We also used the fact that \(\mathbf{z} \neq \mathbf{0}\). For all \(y\),

\[ \sum_{x} \pi_x p_{x,y} = \sum_{x} \frac{z_x}{\sum_{i} z_i} p_{x,y} = \frac{1}{\sum_{i} z_i} \sum_{x} z_x p_{x,y} = \frac{z_y}{\sum_{i} z_i} = \pi_y. \]

The same holds with \(p_{x,y}\) replaced with \(r_{x,y}\) by Claim 3. Since \(r_{x,y} > 0\) and \(\mathbf{z} \neq \mathbf{0}\) it follows that \(\pi_y > 0\) for all \(y\). That proves the claim. \(\square\)

It remains to prove uniqueness.

Suppose there are two distinct stationary distributions \(\bpi_1\) and \(\bpi_2\). Since they are distinct, they are not a multiple of each other and therefore are linearly independent. Apply Gram-Schmidt:

\[ \mathbf{q}_1 = \frac{\bpi_1}{\|\bpi_1\|} \qquad \text{and} \qquad\mathbf{q}_2 = \frac{\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1}{\|\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1\|}. \]

Then

\[ \mathbf{q}_1 P = \frac{\bpi_1}{\|\bpi_1\|} P = \frac{\bpi_1 P}{\|\bpi_1\|} = \frac{\bpi_1}{\|\bpi_1\|} = \mathbf{q}_1 \]

and all entries of \(\mathbf{q}_1\) are strictly positive.

Similarly,

\[\begin{align*} \mathbf{q}_2 P &= \frac{\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1}{\|\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1\|} P\\ &= \frac{\bpi_2 P - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1 P}{\|\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1\|}\\ &= \frac{\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1}{\|\bpi_2 - \langle \bpi_2, \mathbf{q}_1 \rangle \mathbf{q}_1\|}\\ &= \mathbf{q}_2. \end{align*}\]

By the claims above, there is a multiple of \(\mathbf{q}_2\), say \(\mathbf{q}_2' = \alpha \mathbf{q}_2\) with \(\alpha \neq 0\), such that \(\mathbf{q}_2' P = \mathbf{q}_2'\) and all entries of \(\mathbf{q}_2'\) are strictly positive.

By the Gram-Schmidt procedure,

\[ \langle \mathbf{q}_1, \mathbf{q}_2' \rangle = \langle \mathbf{q}_1, \alpha \mathbf{q}_2 \rangle = \alpha \langle \mathbf{q}_1, \mathbf{q}_2 \rangle = 0. \]

But this is a contradiction - both vectors are strictly positive. That concludes the proof. \(\square\)

A few more observations about the eigenvalues of \(P\).

(1) Whenever \(\lambda\) is a left eigenvalue of \(P\) (i.e. \(\mathbf{z} P = \lambda \mathbf{z}\) for some \(\mathbf{z} \in \mathbb{R}^n\) as a row vector), it is also a right eigenvalue of \(P\) (i.e. \(P \mathbf{y} = \lambda \mathbf{y}\) for some \(\mathbf{y} \in \mathbb{R}^n\)). One way to see this using our previous results is to note that \(\mathbf{z} P = \lambda \mathbf{z}\) is equivalent to \(P^T \mathbf{z}^T = \lambda \mathbf{z}^T\), or put differently \((P^T - \lambda I) \mathbf{z}^T = \mathbf{0}\), so that

\[ \mathbf{z}^T \in \mathrm{null}(P^T - \lambda I). \]

Similarly, \(P \mathbf{y} = \lambda \mathbf{y}\) is equivalent to

\[ \mathbf{y} \in \mathrm{null}(P - \lambda I). \]

By the Rank-Nullity Theorem, these two null spaces have the same dimension since \((P - \lambda I)^T = P^T - \lambda I^T = P^T - \lambda I\). In particular, when one of them has dimension greater than \(0\) (i.e., it contains non-zero vectors), so does the other.

That is not to say that they are the same spaces! Only their dimensions match. In other words, the left and right eigenvalues are the same, but the left and right eigenvectors are not.

(2) What we have shown in the previous theorem is that, if \(P\) is irreducible then it has a unique (up to scaling) left eigenvector of eigenvalue \(1\). By the first observation, \(\mathbf{1}\) is also the unique right eigenvector of \(P\) with eigenvalue \(1\) in this case. That is, the geometric multiplicity of \(1\) is \(1\).

(3) What about the other eigenvalues? Suppose that \(\mathbf{z} P = \lambda \mathbf{z}\) for a non-zero row vector \(\mathbf{z}\), then taking the \(\ell_1\)-norm of the left-hand side we get

\[\begin{align*} \|\mathbf{z} P\|_1 &= \sum_{j=1}^n \left|\sum_{i=1}^n z_i p_{i,j} \right|\\ &\leq \sum_{j=1}^n \sum_{i=1}^n |z_i| p_{i,j}\\ &\leq \sum_{i=1}^n |z_i| \sum_{j=1}^n p_{i,j}\\ &\leq \sum_{i=1}^n |z_i|\\ &= \|\mathbf{z}\|_1, \end{align*}\]

where we used that \(P\) is stochastic.

The \(\ell_1\)-norm of the right-hand side is

\[\begin{align*} \|\lambda \mathbf{z} \|_1 &= |\lambda| \|\lambda \mathbf{z} \|_1. \end{align*}\]

Hence \(\|\lambda \mathbf{z} \|_1 \leq \|\mathbf{z}\|_1\), which after simplifying implies

\[ |\lambda| \leq 1. \]

(3) So all left and right eigenvalues of \(P\) are smaller or equal than \(1\) in absolute value. In the irreducible case, we know that \(1\) is achieved and has geometric multiplicity \(1\). What about \(-1\)? Suppose \(\mathbf{z} P = - \mathbf{z}\). Then applying \(P\) again to both sides we get

\[ \mathbf{z} P^2 = - \mathbf{z} P = \mathbf{z}. \]

So if \(P^2\) (which is stochastic; why?) is irreducible, then there is a unique such \(\mathbf{z}\).

But we already know of one. Indeed, the unique stationary distribution \(\bpi\) of \(P\) satisfies

\[ \bpi P^2 = \bpi P = \bpi. \]

But it does not satisfy \(\bpi P = - \bpi\). Hence there is no eigenvector with eigenvalue \(-1\) in that case.

When is \(P^2\) irreducible? One such case is when \(P\) is irreducible and weakly lazy. Then, any directed path between two states in the transition graph of \(P\) can be transformed into a directed path between the same states in \(P^2\) by adding self-loops every other step.

Proof of the convergence theorem We prove the convergence to equilibrium theorem in the irreducible, weakly lazy case. Throughout this section, \((X_t)_{t \geq 0}\) is an irreducible, weakly lazy Markov chain on state space \(\mathcal{S} = [n]\) with transition matrix \(P = (p_{i,j})_{i,j=1}^n\), initial distribution \(\bmu = (\mu_1,\ldots,\mu_n)\) and unique stationary distribution \(\bpi = (\pi_1,\ldots,\pi_n)\).

We give a probabilistic proof of the Convergence to Equilibrium Theorem. The proof uses a clever idea: coupling. Separately from \((X_t)_{t \geq 0}\), we consider an independent Markov chain \((Y_t)_{t \geq 0}\) with the same transition matrix but initial distribution \(\bpi\). By the definition of stationarity,

\[ \P[Y_t = i] = \pi_i, \]

for all \(i\) and all \(t\). Hence it suffices to show that

\[ |\P[X_t = i] - \pi_i| = |\P[X_t = i] - \P[Y_t = i]| \to 0 \]

as \(t \to +\infty\) for all \(i\).

Step 1: Showing the joint process is Markov. We observe first that the joint process \((X_0,Y_0),(X_1,Y_1),\ldots\) is itself a Markov chain! Let’s just check the definition. By the independence of \((X_t)\) and \((Y_t)\),

\[\begin{align*} &\P[(X_t,Y_t) = (x_t,y_t)\,|\,(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1}),\ldots,(X_0,Y_0) = (x_0,y_0)]\\ &=\frac{\P[(X_t,Y_t) = (x_t,y_t),(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1}),\ldots,(X_0,Y_0) = (x_0,y_0)]} {\P[(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1}),\ldots,(X_0,Y_0) = (x_0,y_0)]}\\ &=\frac{\P[X_t = x_t,X_{t-1} = x_{t-1},\ldots,X_0 = x_0]\,\P[Y_t = y_t,Y_{t-1} = y_{t-1},\ldots,Y_0 = y_0]} {\P[X_{t-1} = x_{t-1},\ldots,X_0 = x_0]\,\P[Y_{t-1} = y_{t-1},\ldots,Y_0 = y_0]}\\ &=\frac{\P[X_t = x_t,X_{t-1} = x_{t-1},\ldots,X_0 = x_0]} {\P[X_{t-1} = x_{t-1},\ldots,X_0 = x_0]} \frac{\P[Y_t = y_t,Y_{t-1} = y_{t-1},\ldots,Y_0 = y_0]} {\P[Y_{t-1} = y_{t-1},\ldots,Y_0 = y_0]}\\ &=\P[X_t = x_t\,|\,X_{t-1} = x_{t-1}, \ldots,X_0 = x_0] \,\P[Y_t = y_t\,|\,Y_{t-1} = y_{t-1}, \ldots,Y_0 = y_0]. \end{align*}\]

Now we use the fact that each is separately a Markov chain to simplify this last expression

\[\begin{align*} &= \P[X_t = x_t\,|\,X_{t-1} = x_{t-1}] \,\P[Y_t = y_t\,|\,Y_{t-1} = y_{t-1}]\\ &= \frac{\P[X_t = x_t,X_{t-1} = x_{t-1}]}{\P[X_{t-1} = x_{t-1}]} \frac{\P[Y_t = y_t,Y_{t-1} = y_{t-1}]}{\P[Y_{t-1} = y_{t-1}]}\\ &= \frac{\P[X_t = x_t,X_{t-1} = x_{t-1}]\,\P[Y_t = y_t,Y_{t-1} = y_{t-1}]}{\P[X_{t-1} = x_{t-1}]\,\P[Y_{t-1} = y_{t-1}]}\\ &= \frac{\P[(X_t,Y_t) = (x_t,y_t),(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1})]}{\P[(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1})]}\\ &= \P[(X_t,Y_t) = (x_t,y_t)\,|\,(X_{t-1},Y_{t-1}) = (x_{t-1},y_{t-1})]. \end{align*}\]

That proves the claim.

Step 2: Waiting until the marginal processes meet. The idea of the coupling argument is to consider the first time \(T\) that \(X_T = Y_T\). Note that \(T\) is a random time. But it is a special kind of random time often referred to as a stopping time. That is, the event \(\{T=s\}\) only depends on the trajectory of the joint chain \((X_t,Y_t)\) up to time \(s\). Specifically,

\[ \{T=s\} = \left\{ ((X_0,Y_0),\ldots,(X_{s-1},Y_{s-1})) \in \mathcal{N}^2_{s-1}, X_s = Y_s \right\} \]

where

\[ \mathcal{N}^2_{s-1} = \{ ((x_0,y_0),\ldots,(x_{s-1},y_{s-1})) \in [\mathcal{S}\times\mathcal{S}]^{s}\,:\,x_i \neq y_i, \forall 0 \leq i \leq s-1 \}. \]

Here is a remarkable observation. The distributions of \(X_s\) and \(Y_s\) are the same after the coupling time \(T\).

LEMMA (Distribution after Coupling) For all \(s\) and all \(i\),

\[ \P[X_s = i, T \leq s] = \P[Y_s = i, T \leq s]. \]

\(\flat\)

Proof: We sum over the possible values of \(T\) and \(X_T=Y_T\), and use the multiplication rule

\[\begin{align*} &\P[X_s = i, T \leq s]\\ &= \sum_{j=1}^n \sum_{h=0}^s \P[X_s = i, T=h, X_h = j]\\ &= \sum_{j=1}^n \sum_{h=0}^s \P[X_s = i \,|\, T=h, X_h = j] \,\P[T=h, X_h = j]\\ &= \sum_{j=1}^n \sum_{h=0}^s \P[X_s = i \,|\, ((X_0,Y_0),\ldots,(X_{h-1},Y_{h-1})) \in \mathcal{N}^2_{h-1}, (X_h, Y_h) = (j,j)]\\ &\qquad\qquad\qquad\qquad \times \P[((X_0,Y_0),\ldots,(X_{h-1},Y_{h-1})) \in \mathcal{N}^2_{h-1}, (X_h, Y_h) = (j,j)]. \end{align*}\]

Using the Markov property for the joint process, in the form stated in Exercise 3.36, we get that this last expression is

\[\begin{align*} &= \sum_{j=1}^n \sum_{h=0}^s \P[X_s = i \,|\, (X_h, Y_h) = (j,j)]\\ &\qquad\qquad\times \P[((X_0,Y_0),\ldots,(X_{h-1},Y_{h-1})) \in \mathcal{N}^2_{h-1}, (X_h, Y_h) = (j,j)]. \end{align*}\]

By the independence of the marginal processes and the fact that they have the same transition matrix, this is

\[\begin{align*} &= \sum_{j=1}^n \sum_{h=0}^s \P[X_s = i \,|\, X_h = j]\\ &\qquad\qquad\times \P[((X_0,Y_0),\ldots,(X_{h-1},Y_{h-1})) \in \mathcal{N}^2_{h-1}, (X_h, Y_h) = (j,j)]\\ &= \sum_{j=1}^n \sum_{h=0}^s \P[Y_s = i \,|\, Y_h = j]\\ &\qquad\qquad\times \P[((X_0,Y_0),\ldots,(X_{h-1},Y_{h-1})) \in \mathcal{N}^2_{h-1}, (X_h, Y_h) = (j,j)]. \end{align*}\]

Arguing backwards gives \(\P[Y_s = i, T \leq s]\) and concludes the proof. \(\square\)

Step 3: Bounding how long it takes for the marginal processes to meet. Since

\[ \P[X_t = i] = \P[X_t = i, T \leq t] + \P[X_t = i, T > t] \]

and similarly for \(\P[Y_t = i]\), using the previous lemma we get

\[\begin{align*} &|\P[X_t = i] - \P[Y_t = i]|\\ &= |\P[X_t = i, T > t] - \P[Y_t = i, T > t]|\\ &\leq \P[X_t = i, T > t] + \P[Y_t = i, T > t]\\ &\leq 2 \P[T > t]. \end{align*}\]

So it remains to show that \(\P[T > t]\) goes to \(0\) as \(t \to +\infty\).

We note that \(\P[T > t]\) is non-increasing as a function of \(t\). Indeed, for \(h > 0\), we have the implication

\[ \{T > t+h\} \subseteq \{T > t\}, \]

so by the monotonicity of probabilities

\[ \P[T > t+h] \leq \P[T > t]. \]

So it remains to prove the following lemma.

LEMMA (Tail of Coupling Time) There is a \(0 < \beta < 1\) and a positive integer \(m\) such that, for all positive integers \(k\),

\[ \P[T > k m] \leq \beta^{k m}. \]

\(\flat\)

Proof: Recall that the state space of the marginal processes is \([n]\), so the joint process lives in \([n]\times [n]\). Since the event \(\{(X_m, Y_m) = (1,1)\}\) implies \(\{T \leq m\}\), to bound \(\P[T > m]\) we note that

\[ \P[T > m] = 1 - \P[T \leq m] \leq 1 - \P[(X_m, Y_m) = (1,1)]. \]

To bound the probability on right-hand side, we construct a path of length \(m\) in the transition graph of the joint process from any state to \((1,1)\).

For \(i \in [n]\), let \(\mathcal{P}_i = (z_{i,0},\ldots,z_{i,\ell_i})\) be the shortest path in the transition graph of \((X_t)_{t \geq 0}\) from \(i\) to \(1\), where \(z_{i,0} = i\) and \(z_{i,\ell_i} = 1\). By irreducibility there exists such a path for any \(i\). Here \(\ell_i\) is the length of \(\mathcal{P}_i\), and we define

\[ \ell^* = \max_{i \neq 1} \ell_i. \]

We make all the paths above the same length \(\ell^*\) by padding them with \(1\)s. That is, we define \(\mathcal{P}_i^* = (z_{i,0},\ldots,z_{i,\ell_i},1,\ldots,1)\) such that this path now has length \(\ell^*\). This remains a path in the transition graph of \((X_t)_{t \geq 0}\) because the chain is weakly lazy.

Now, for any pair of states \((i,j) \in [n] \times [n]\), consider the path of length \(m := 2 \ell^*\)

\[ \mathcal{Q}^*_{(i,j)} = ((z_{i,0},j),\ldots,(z_{i,\ell_i},j),(1,j),\ldots,(1,j),(1,z_{j,0}),\ldots,(1,z_{i,\ell_i}),(1,1),\ldots,(1,1)). \]

In words, we leave the second component at \(j\) while running through path \(\mathcal{P}_i^*\) on the first component, then we leave the first component at \(1\) while running through path \(\mathcal{P}_j^*\) on the second component. Path \(\mathcal{Q}^*_{(i,j)}\) is a valid path in the transition graph of the joint chain. Again, we are using that the marginal processes are weakly lazy.

Denote by \(\mathcal{Q}^*_{(i,j)}[r]\) be the \(r\)-th state in path \(\mathcal{Q}^*_{(i,j)}\). Going back to bounding \(\P[(X_m, Y_m) = (1,1)]\), we define \(\beta\) as follows

\[\begin{align*} &\P[(X_m, Y_m) = (1,1)\,|\,(X_0, Y_0) = (i,j)]\\ &\geq \min_{(i,j)} \P[(X_r, Y_r) = \mathcal{Q}^*_{(i,j)}[r], \forall r=0,\ldots,m \,|\,(X_0, Y_0) = (i_0,j_0)]\\ &=: 1-\beta \in (0,1). \end{align*}\]

By the law of total probability,

\[\begin{align*} &\P[(X_m, Y_m) = (1,1)]\\ &= \sum_{(i,j) \in [n]\times [n]}\P[(X_m, Y_m) = (1,1)\,|\,(X_0, Y_0) = (i,j)]\, \P[(X_0, Y_0) = (i,j)]\\ &= \sum_{(i,j) \in [n]\times [n]}\P[(X_m, Y_m) = (1,1)\,|\,(X_0, Y_0) = (i,j)]\, \mu_i \pi_j\\ &\geq \sum_{(i,j) \in [n]\times [n]} \P[(X_r, Y_r) = \mathcal{Q}^*_{(i,j)}[r], \forall r=0,\ldots,m \,|\,(X_0, Y_0) = (i,j)]\, \mu_i \pi_j\\ &\geq (1-\beta) \sum_{(i,j) \in [n]\times [n]} \mu_i \pi_j\\ &= 1-\beta. \end{align*}\]

So we have

\[ \P[T > m] \leq \beta. \]

Because

\[ \{T > 2m\} \subseteq \{T > m\}, \]

it holds that by the multiplication rule that

\[\begin{align*} \P[T > 2m] &= \P[\{T > 2m\}\cap\{T > m\}]\\ &= \P[T > 2m\,|\, T > m]\,\P[T > m]\\ &\leq \P[T > 2m\,|\, T > m]\,\beta. \end{align*}\]

Summing over all state pairs at time \(m\), we get

\[\begin{align*} &\P[T > 2m\,|\, T > m]\\ &= \sum_{(i,j) \in [n]\times [n]} \P[T > 2m\,|\, T > m, (X_m,Y_m) = (i,j)] \,\P[(X_m,Y_m) = (i,j) \,|\, T > m]. \end{align*}\]

Arguing as above, note that for \(i \neq j\)

\[\begin{align*} &\P[T \leq 2m\,|\, T > m, (X_m,Y_m) = (i,j)]\\ &\geq \P[(X_{2m},Y_{2m}) = (1,1) \,|\, T > m, (X_m,Y_m) = (i,j)]\\ &= \P[(X_{2m},Y_{2m}) = (1,1) \,|\, (X_m,Y_m) = (i,j), ((X_0,Y_0),\ldots,(X_{m-1},Y_{m-1})) \in \mathcal{N}^2_{m-1}]\\ &= \P[(X_{2m},Y_{2m}) = (1,1) \,|\, (X_m,Y_m) = (i,j)]\\ &= \P[(X_{m},Y_{m}) = (1,1) \,|\, (X_0,Y_0) = (i,j)]\\ &\geq 1-\beta, \end{align*}\]

by the Markov property and the time-homogeneity of the process.

Plugging above and noting that \(\P[(X_m,Y_m) = (i,j) \,|\, T > m] = 0\) when \(i = j\), we get that

\[\begin{align*} &\P[T > 2m\,|\, T > m]\\ &\geq \sum_{\substack{(i,j) \in [n]\times [n]\\i \neq j}} \beta \,\P[(X_m,Y_m) = (i,j) \,|\, T > m]\\ &= \beta. \end{align*}\]

So we have proved that \(\P[T > 2m] \leq \beta^2\). Proceeding similarly by induction gives the claim. \(\square\)