7.8. Online supplementary material#

7.8.1. Quizzes, solutions, code, etc.#

7.8.1.1. Just the code#

An interactive Jupyter notebook featuring the code in this chapter can be accessed below (Google Colab recommended). You are encouraged to tinker with it. Some suggested computational exercises are scattered throughout. The notebook is also available as a slideshow.

7.8.1.2. Self-assessment quizzes#

A more extensive web version of the self-assessment quizzes is available by following the links below.

7.8.1.3. Auto-quizzes#

Automatically generated quizzes for this chapter can be accessed here (Google Colab recommended).

7.8.1.4. Solutions to odd-numbered warm-up exercises#

(with help from Claude, Gemini, and ChatGPT)

E7.2.1 We need to check that all entries of P are non-negative and that all rows sum to one.

Non-negativity: All entries of P are clearly non-negative.

Row sums: Row 1: 0.2+0.5+0.3=1 Row 2: 0.4+0.1+0.5=1 Row 3: 0.6+0.3+0.1=1

Therefore, P is a stochastic matrix.

E7.2.3

P[X2=2]=(μP2)2=(0.2,0.3,0.5)(0.440.310.250.440.370.190.360.330.31)2=0.20.31+0.30.37+0.50.33=0.338.

E7.2.5

G = nx.DiGraph()
G.add_nodes_from(range(1, 5))
G.add_edges_from([(1, 2), (1, 4), (2, 1), (2, 3), (3, 2), (3, 4), (4, 1), (4, 3)])

The transition graph has a directed edge from state i to state j if and only if pi,j>0, as stated in the definition.

E7.2.7

P[X2=2|X0=3]=(P2)3,2=(0.10.40.50.20.60.20.30.30.4)3,22=(0.290.380.330.260.440.300.270.360.37)3,2=0.36.

E7.2.9 The probability is given by the entry (P2)0,1, where P2 is the matrix product P×P.

P2=(1/32/31/21/2)(1/32/31/21/2)=(7/1811/185/127/12)

Therefore, the probability is 11/18.

E7.2.11 The marginal distribution at time 2 is given by μP2.

P2=(1/201/20101/31/31/3)(1/201/20101/31/31/3)=(5/121/65/120104/94/91/9)

Thus,

μP2=(1/4,1/2,1/4)(5/121/65/120104/94/91/9)=(13/36,19/36,4/36)

E7.2.13 The distribution at time 1 is given by μP.

μP=(1/3,2/3)(1/21/210)=(2/3,1/3)

E7.2.15 The chain alternates deterministically between states 1 and 2. Therefore, starting in state 1, it will return to state 1 after exactly 2 steps.

E7.3.1 Yes, the matrix is irreducible. The corresponding transition graph is a cycle, which is strongly connected.

E7.3.3 We need to check if πP=π. Indeed,

(0.6,0.4)(0.40.60.70.3)=(0.60.4+0.40.7,0.60.6+0.40.3)=(0.52,0.48)(0.6,0.4),

so π is not a stationary distribution of the Markov chain.

E7.3.5 Let π=(π1,π2) be a stationary distribution. Then, we need to solve the system of equations:

π10.5+π20.5=π1π1+π2=1

The first equation simplifies to π1=π2, and substituting this into the second equation yields 2π1=1, or π1=π2=0.5. Therefore, π=(0.5,0.5) is a stationary distribution.

E7.3.7 To verify that π is a stationary distribution, we need to check if πP=π. Let’s perform the matrix multiplication step by step:

πP=(13,13,13)(0.40.30.30.20.50.30.40.20.4)

=(130.4+130.2+130.4,130.3+130.5+130.2,130.3+130.3+130.4)

=(0.4+0.2+0.43,0.3+0.5+0.23,0.3+0.3+0.43)

=(13,13,13)

=π

The result is equal to π=(13,13,13). Therefore, the uniform distribution π is indeed a stationary distribution for the matrix P. Note that the transition matrix P is doubly stochastic because each row and each column sums to 1. This property ensures that the uniform distribution is always a stationary distribution for doubly stochastic matrices, as mentioned in the text.

E7.3.9 Let 1=(1,1,,1)T be the column vector of all ones. For any stochastic matrix P, we have:

P1=(p1,1p1,npn,1pn,n)(11)=(j=1np1,jj=1npn,j)=(11)=1,

because each row of a stochastic matrix sums to 1. Therefore, 1 is a right eigenvector of P with eigenvalue 1.

E7.3.11 Let π=(π1,π2,π3) be the stationary distribution. We need to solve the system of equations:

Step 1: Write out the system of equations based on πP=π.

π10.7+π20.4+π30.6=π1π10.2+π20.4+π30.1=π2π10.1+π20.2+π30.3=π3

Step 2: Rearrange the equations to have zero on the right-hand side.

π10.7+π20.4+π30.6π1=0π10.2+π20.4+π30.1π2=0π10.1+π20.2+π30.3π3=0

Step 3: Simplify the equations.

0.3π1+0.4π2+0.6π3=00.2π10.6π2+0.1π3=00.1π1+0.2π20.7π3=0

Step 4: Use the condition π1+π2+π3=1 to eliminate one variable, say π3.

π3=1π1π2

Step 5: Substitute the expression for π3 into the simplified equations from Step 3.

0.3π1+0.4π2+0.6(1π1π2)=00.2π10.6π2+0.1(1π1π2)=0

Step 6: Simplify the equations further.

0.9π10.2π2=0.60.1π10.7π2=0.1

Step 7: Solve the linear system of two equations with two unknowns using any suitable method (e.g., substitution, elimination, or matrix inversion).

Using the substitution method: From the second equation, express π1 in terms of π2:

π1=7π21

Substitute this into the first equation:

0.9(7π21)0.2π2=0.66.3π2+0.90.2π2=0.66.5π2=1.5π2=313

Now, substitute π2=313 back into the expression for π1:

π1=73131=21131313=813

Finally, use π1+π2+π3=1 to find π3:

π3=1π1π2=1813313=213

Therefore, the stationary distribution is (813,313,213).

E7.4.1 Given the transition matrix of a Markov chain:

P=(0.20.80.60.4),

determine if the chain is lazy.

E7.4.3 Given a Markov chain with transition matrix:

P=(0.40.60.70.3),

and initial distribution μ=(0.2,0.8), compute limtμPt.

E7.5.1 The degree matrix D is

D=(2000030000300002).

The degrees are computed by summing the entries in each row of A.

E7.5.3 A stochastic matrix has rows that sum to 1. The rows of P sum to 1:

12+12=1,13+13+13=1,13+13+13=1,12+12=1.

Thus, P is stochastic.

E7.5.5

P=D1A=(10000100001/200001)(0100001010010010)=(010000101/2001/20010).

E7.5.7 First, compute the transition matrix:

P=D1A=(100001/200001/200001)(0100101001010010)=(01001/201/2001/201/20010).

Then, compute the modified transition matrix with the damping factor:

Q=αP+(1α)1411T=0.8P+0.2(1/41/41/41/41/41/41/41/41/41/41/41/41/41/41/41/4).

E7.5.9 First, compute the transition matrix:

P=(01000001001/2001/201/20001/200010).

Then, compute the modified transition matrix with the damping factor:

Q=0.9P+0.1(1/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/51/5).

E7.5.11 The new adjacency matrix A is

A=(0100001010010001).

A self-loop is added to vertex 4.

E7.5.13 The modified transition matrix Q is

Q=αP+(1α)1n11T=0.85P+0.151411T.

Using the values from P and 11T, we get:

Q=(0.03750.88750.03750.03750.03750.03750.88750.03750.46250.03750.03750.46250.03750.03750.03750.8875).

E7.6.1 The acceptance probability is given by

min{1,π1π2Q(1,2)Q(2,1)}=min{1,0.10.20.50.5}=12.

E7.6.3 Since the proposal chain is symmetric, the acceptance probability simplifies to $min{1,π2π1}=min{1,0.20.1}=1.$

E7.6.5 The acceptance probability is given by

min{1,π(y)π(x)Q(y,x)Q(x,y)}=min{1,e9/2e2}=e5/2.

E7.6.7 The conditional probability is given by

π1v(1|v1,h)=σ(j=12w1,jhj+b1)=σ(11+(1)0+1)=σ(2)0.881.

E7.6.9 The energy is given by

E(v,h)=vTWh=(10)(1230)(11)=1.

E7.6.11 The conditional mean for the hidden units is:

E[hj|v]=σ(iWijvi+cj).

For h1:

E[h1|v]=σ(0.51+0.300.61+0.1)=σ(0.50.6+0.1)=σ(0)=0.5.

For h2:

E[h2|v]=σ(0.21+0.80+0.110.3)=σ(0.2+0.10.3)=σ(0.4)=11+e0.40.4013.

7.8.1.5. Learning outcomes#

  • Define a discrete-time Markov chain and its state space, initial distribution, and transition probabilities.

  • Identify Markov chains in real-world scenarios, such as weather patterns or random walks on graphs.

  • Construct the transition matrix for a time-homogeneous Markov chain.

  • Construct the transition graph of a Markov chain from its transition matrix.

  • Prove that the transition matrix of a Markov chain is a stochastic matrix.

  • Apply the Markov property to calculate the probability of specific sequences of events in a Markov chain using the distribution of a sample path.

  • Compute the marginal distribution of a Markov chain at a specific time using matrix powers.

  • Use simulation to generate sample paths of a Markov chain.

  • Define a stationary distribution of a finite-state, discrete-time, time-homogeneous Markov chain and express the defining condition in matrix form.

  • Explain the relationship between stationary distributions and left eigenvectors of the transition matrix.

  • Determine whether a given probability distribution is a stationary distribution for a given Markov chain.

  • Identify irreducible Markov chains and explain their significance in the context of stationary distributions.

  • State the main theorem on the existence and uniqueness of a stationary distribution for an irreducible Markov chain.

  • Compute the stationary distribution of a Markov chain numerically using either eigenvalue methods or by solving a linear system of equations, including the “Replace an Equation” method.

  • Define the concepts of aperiodicity and weak laziness for finite-space discrete-time Markov chains.

  • State the Convergence to Equilibrium Theorem and the Ergodic Theorem for irreducible Markov chains.

  • Verify the Ergodic Theorem by simulating a Markov chain and comparing the frequency of visits to each state with the stationary distribution.

  • Explain the concept of coupling and its role in proving the Convergence to Equilibrium Theorem for irreducible, weakly lazy Markov chains.

  • Define random walks on directed and undirected graphs, and express their transition matrices in terms of the adjacency matrix.

  • Determine the stationary distribution of a random walk on an undirected graph and explain its relation to degree centrality.

  • Explain the concept of reversibility and its connection to the stationary distribution of a random walk.

  • Describe the PageRank algorithm and its interpretation as a modified random walk on a directed graph.

  • Compute the PageRank vector by finding the stationary distribution of the modified random walk using power iteration.

  • Apply the PageRank algorithm to real-world datasets to identify central nodes in a network.

  • Explain the concept of Personalized PageRank and how it differs from the standard PageRank algorithm.

  • Modify the PageRank algorithm to implement Personalized PageRank and interpret the results based on the chosen teleportation distribution.

  • Describe the concept of Markov Chain Monte Carlo (MCMC) and its application in sampling from complex probability distributions.

  • Explain the Metropolis-Hastings algorithm, including the proposal distribution and acceptance-rejection steps.

  • Calculate acceptance probabilities in the Metropolis-Hastings algorithm for a given target and proposal distribution.

  • Prove the correctness of the Metropolis-Hastings algorithm by showing that the resulting Markov chain is irreducible and reversible with respect to the target distribution.

  • Implement the Gibbs sampling algorithm for a given probabilistic model, such as a Restricted Boltzmann Machine (RBM).

  • Analyze the conditional independence properties of RBMs and their role in Gibbs sampling.

7.8.2. Additional sections#

7.8.2.1. Random walk on a weighted graph#

The previous definitions extend naturally to the weighted case. Again we allow loops, i.e., self-weights wi,i>0. For a weighted graph G, recall that the degree of a vertex is defined as

δ(i)=jwi,j,

which includes the self-weight wi,i, and where we use the convention that wi,j=0 if {i,j}E. Recall also that wi,j=wj,i.

DEFINITION (Random Walk on a Weighted Graph) Let G=(V,E,w) be a weighted graph with positive edge weights. Assume all vertices have a positive degree. A random walk on G is a time-homogeneous Markov chain (Xt)t0 with state space S=V and transition probabilities

pi,j=P[Xt+1=j|Xt=i]=wi,jkwi,k,i,jV.

Once again, it is easily seen that the transition matrix of random walk on G satisfying the conditions of the definition above is P=D1A, where D=diag(A1) is the degree matrix.

EXAMPLE: (A Weighted Graph) Here is another example. Consider the following adjacency matrix on 5 vertices.

A_ex = np.array([
    [0, 8, 0, 1, 0],
    [8, 0, 6, 1, 0],
    [0, 6, 0, 0, 7],
    [1, 1, 0, 0, 10],
    [0, 0, 7, 10, 0]])

It is indeed a symmetric matrix.

print(LA.norm(A_ex.T - A_ex))
0.0

We define a graph from its adjacency matrix.

n_ex = A_ex.shape[0] # number of vertices
G_ex = nx.from_numpy_array(A_ex) # graph

To draw it, we first define edge labels by creating a dictionary that assigns to each edge (as a tuple) its weight. Here G.edges.data('weight') (see G.edges) iterates through the edges (u,v) and includes their weight as the third entry of the tuple (u,v,w). Then we use the function networkx.draw_networkx_edge_labels() to add the weights as edge labels.

edge_labels = {}
for (u,v,w) in G_ex.edges.data('weight'):
    edge_labels[(u,v)] = w
pos=nx.circular_layout(G_ex)
nx.draw_networkx(G_ex, pos, labels={i: i+1 for i in range(n_ex)},
                 node_color='black', font_color='white')
edge_labels = nx.draw_networkx_edge_labels(G_ex, pos, edge_labels=edge_labels)
plt.axis('off')
plt.show()
../../_images/696132b74026ec3eb91e20e630468fd57042dca1726e5be92d5b48b5435bc3dd.png

The transition matrix of the random walk on this graph can be computed using the lemma above. We first compute the degree matrix, then apply the formula.

degrees_ex = A_ex @ np.ones(n_ex)
inv_degrees_ex = 1/ degrees_ex
Dinv_ex = np.diag(inv_degrees_ex)
P_ex = Dinv_ex @ A_ex
print(P_ex)
[[0.         0.88888889 0.         0.11111111 0.        ]
 [0.53333333 0.         0.4        0.06666667 0.        ]
 [0.         0.46153846 0.         0.         0.53846154]
 [0.08333333 0.08333333 0.         0.         0.83333333]
 [0.         0.         0.41176471 0.58823529 0.        ]]

This is indeed a stochastic matrix.

print(P_ex @ np.ones(n_ex))
[1. 1. 1. 1. 1.]

LEMMA (Irreducibility in Undirected Case) Let G=(V,E,w) be a graph with positive edge weights. Assume all vertices have a positive degree. Random walk on G is irreducible if and only if G is connected.

THEOREM (Stationary Distribution on a Graph) Let G=(V,E,w) be a graph with positive edge weights. Assume further that G is connected. Then the unique stationary distribution of random walk on G is given by

πi=δ(i)iVδ(i),iV.

EXAMPLE: (A Weighted Graph, continued) Going back to our weighted graph example, we use the previous theorem to compute the stationary distribution. Note that the graph is indeed connected so the stationary distribution is unique. We have already computed the degrees.

print(degrees_ex)
[ 9. 15. 13. 12. 17.]

We compute iVδ(i) next.

total_degrees_ex = degrees_ex @ np.ones(n_ex)
print(total_degrees_ex)
66.0

Finally, the stationary distribution is:

pi_ex = degrees_ex / total_degrees_ex
print(pi_ex)
[0.13636364 0.22727273 0.1969697  0.18181818 0.25757576]

We check stationarity.

print(LA.norm(P_ex.T @ pi_ex - pi_ex))
2.7755575615628914e-17

A random walk on a weighted undirected graph is reversible. Vice versa, it turns out that any reversible chain can be seen as a random walk on an appropriately defined weighted undirected graph. See the exercises.

7.8.2.2. Spectral techniques for random walks on graphs#

In this section, we use techniques from spectral graph theory to analyze random walks on graphs.

Applying the spectral theorem via the normalized Laplacian We have seen how to compute the unique stationary distribution π of random walk on a connected weighted (undirected) graph. Recall that π is a (left) eigenvector of P with eigenvalue 1 (i.e., πP=π). In general, however, the matrix P is not symmetric in this case (see the previous example) - even though the adjacency matrix is. So we cannot apply the spectral theorem to get the rest of the eigenvectors - if they even exist. But, remarkably, a symmetric matrix with the a closely related spectral decomposition is hiding in the background.

Recall that the normalized Laplacian of a weighted graph G=(V,E,w) with adjacency matrix A and degree matrix D is defined as

L=ID1/2AD1/2.

Recall that in the weighted case, the degree is defined as δ(i)=j:{i,j}Ewi,j. Because it is symmetric and positive semi-definite, we can write

L=i=1nηiziziT,

where the zis are orthonormal eigenvectors of L and the eigenvalues satisfy 0η1η2ηn.

Moreover, D1/21 is an eigenvector of L with eigenvalue 0. So η1=0 and we set

(z1)i=(D1/21D1/212)i=δ(i)iVδ(i),i[n],

which makes z1 into a unit norm vector.

We return to the eigenvectors of P. When a matrix ARn×n is diagonalizable, it has an eigendecomposition of the form

A=QΛQ1,

where Λ is a diagonal matrix whose diagonal entries are the eigenvalues of A. The columns of Q are the eigenvectors of A and they form a basis of Rn. Unlike the symmetric case, however, the eigenvectors need not be orthogonal.

THEOREM (Eigendecomposition of Random Walk on a Graph) Let G=(V,E,w) be a graph with positive edge weights and no isolated vertex, and with degree matrix D. Let PRn×n be the transition matrix of random walk on G. Let z1,,znRn and 0η1ηn2 be the eigenvectors and eigenvalues of the normalized Laplacian. Then P has the following eigendecomposition

P=(D1/2Z)(IH)(D1/2Z)1

where the columns of Z are the zis and H is a diagonal matrix with the ηis on its diagonal. This can also be written as

P=1π+i=2nλiD1/2ziziTD1/2,

where

πi=δ(i)jVδ(j)andλi=1ηi,i=1,,n.

Proof: We write L in terms of P. Recall that P=D1A. Hence

L=ID1/2PD1/2.

Rearranging this becomes

P=ID1/2LD1/2.

Hence for all i=1,,n

P(D1/2zi)=(ID1/2LD1/2)(D1/2zi)=(D1/2zi)D1/2LD1/2(D1/2zi)=(D1/2zi)D1/2Lzi=(D1/2zi)D1/2ηizi=(1ηi)(D1/2zi).

Because P is a transition matrix, all its eigenvalues are bounded in absolute value by 1. So |1ηi|1, which implies 0ηi2.

We also note that

(D1/2Z)(D1/2Z)T=D1/2ZZTD1/2=D1/2D1/2=I,

by the orthonormality of the eigenvectors of the normalized Laplacian. So the columns of D1/2Z, i.e, D1/2zi for i=1,,n, are linearly independent and

(D1/2Z)1=(D1/2Z)T.

That gives the first claim.

To get the second claim, we first note that (for similar calculations, see the definition of an SVD)

P=(D1/2Z)(IH)(D1/2Z)1=(D1/2Z)(IH)(D1/2Z)T=D1/2Z(IH)ZTD1/2=i=1nλiD1/2ziziTD1/2,

where λi=1ηi.

We then use the expression for z1 above. We have

D1/2z1=D1/2D1/21D1/212=1D1/212,

while

ziTD1/2=(D1/2z1)T=(D1/2D1/21D1/212)T=(D1D1/212)T.

So

D1/2z1z1TD1/2=1D1/212(D1D1/212)T=1(D1D1/2122)T=1(D1D11)T=1π.

That proves the second claim.

If G is connected and wi,i>0 for all i, then the chain is irreducible and lazy. In that case, there is a unique eigenvalue 1 and 1 is not an eigenvalue, so we must have 0<η2ηn<2.

Limit theorems revisited The Convergence to Equilibrium Theorem implies that in the irreducible, aperiodic case

μPtπ,

as t+, for any initial distribution μ and the unique stationary distribution π. Here we give a simpler proof for random walk on a graph (or more generally a reversible chain), with the added bonus of a convergence rate. This follows from the same argument we used in the Power Iteration Lemma.

DEFINITION (Spectral Gap) Let G=(V,E,w) be a graph with positive edge weights and no isolated vertex. Let PRn×n be the transition matrix of random walk on G. The absolute spectral gap of G is defined as γ=1λ where

λ=max{|λ|:λ is an eigenvalue of Pλ1}.

THEOREM (Convergence to Equilibrium: Reversible Case) Let G=(V,E,w) be a connected graph with positive edge weights and wx,x>0 for all xV. Let PRn×n be the transition matrix of random walk on G and π its unique stationary distribution. Then

μPtπ,

as t+ for any initial distribution μ. Moreover,

|Px,ytπy|γtδ¯δ,

where δ¯=maxxδ(x), δ=minxδ(x) and γ is the absolute spectral gap.

Proof: Similarly to the Power Iteration Lemma, using the Eigendecomposition of Random Walk on a Graph we get

P2=(D1/2Z)(IH)(D1/2Z)1(D1/2Z)(IH)(D1/2Z)1=(D1/2Z)(IH)2(D1/2Z)1.

By induction,

Pt=(D1/2Z)(IH)t(D1/2Z)1=i=1nλit(D1/2zi)(D1/2zi)T=1π+i=2nλit(D1/2zi)(D1/2zi)T,

by calculations similar to the proof of the Eigendecomposition of Random Walk on a Graph.

In the irreducible, lazy case, for i=2,,n, λit0 as t+.

Moreover, |λi|(1γ) for all i=2,,n. Hence,

|Px,ytπy|=i=2nλitδ(x)1/2(zi)x(zi)yδ(y)1/2(1γ)tδ(y)δ(x)i=2n|(zi)x(zi)y|.

We then use Cauchy-Schwarz and the fact that ZZT=I (as Z is an orthogonal matrix), which implies i=1n(zi)x2=1.

We get that the above is

(1γ)tδ(y)δ(x)i=2n(zi)x2i=2n(zi)y2(1γ)tδ(y)δ(x)(1γ)tδ¯δ.

We record an immediate corollary that will be useful next. Let f:VR be a function over the vertices. Define the (column) vector f=(f(1),,f(n))T and note that

πf=xVπxf(x).

It will be convenient to use to -norm. For a vector x=(x1,,xn)T, we let x=maxi[n]|xi|.

THEOREM For any initial distribution μ and any t

|E[f(Xt)]πf|(1γ)tπmin1f.

Proof: By the Time Marginals Theorem,

|E[f(Xt)]πf|=|xyμx(Pt)x,yf(y)yπyf(y)|.

Because xμx=1, the right-hand side is

=|xyμx(Pt)x,yf(y)xyμxπyf(y)|xμxy|(Pt)x,yπy||f(y)|,

by the triangle inequality.

Now by the theorem this is

xμxy(1γ)tπyπmin|f(y)|=(1γ)t1πminxμxyπy|f(y)|(1γ)tπmin1f.

That proves the claim.

We also prove a version of the Ergodic Theorem.

THEOREM (Ergodic Theorem: Reversible Case) Let G=(V,E,w) be a connected graph with positive edge weights and wx,x>0 for all xV. Let PRn×n be the transition matrix of random walk on G and π its unique stationary distribution. Let f:VR be a function over the vertices. Then for any initial distribution μ

1Tt=1Tf(Xt)xVπxf(x),

in probability as T+.

Proof idea: We use Chebyshev’s Inequality. By the Convergence Theorem: Reversible Case, the expectation converges to the limit. To bound the variance, we use the Eigendecomposition of Random Walk on a Graph.

Proof: We use Chebyshev’s Inequality, similarly to our proof of the Weak Law of Large Numbers. However, unlike that case, here the terms in the sum are not independent are require some finessing. Define again the (column) vector f=(f(1),,f(n)). Then the limit can be written as

xVπxf(x)=πf.

By the corollary, the expectation of the sum can be bounded as follows

|E[1Tt=1Tf(Xt)]πf|1Tt=1T|E[f(Xt)]πf|1Tt=1T(1γ)tπmin1fπmin1f1Tt=0+(1γ)t=πmin1fγ11T0

as T+.

Next we bound the variance of the sum. By the Variance of a Sum,

Var[1Tt=1Tf(Xt)]=1T2t=1TVar[f(Xt)]+2T21s<tTCov[f(Xs),f(Xt)].

We bound the variance and covariance separately using the Eigendecomposition of Random Walk on a Graph.

To obtain convergence, a trivial bound on the variance suffices. Then

0Var[f(Xt)]E[f(Xt)2]f2.

Hence,

01T2t=1TVar[f(Xt)]Tf2T20,

as T+.

Bounding the covariance requires a more delicate argument. Fix 1s<tT. The trick is to condition on Xs and use the Markov Property. By definition of the covariance and the Law of Total Expectation,

Cov[f(Xs),f(Xt)]=E[(f(Xs)E[f(Xs)])(f(Xt)E[f(Xt)])]=xE[(f(Xs)E[f(Xs)])(f(Xt)E[f(Xt)])|Xs=x]P[Xs=x]=xE[f(Xt)E[f(Xt)]|Xs=x](f(x)E[f(Xs)])P[Xs=x].

We now use the time-homogeneity of the chain to note that

E[f(Xt)E[f(Xt)]|Xs=x]=E[f(Xt)|X0=x]E[f(Xt)]=E[f(Xts)|X0=x]E[f(Xt)].

We now use the corollary. This expression in absolute value is

|E[f(Xt)E[f(Xt)]|Xs=x]|=|E[f(Xts)|X0=x]E[f(Xt)]|=|(E[f(Xts)|X0=x]πf)(E[f(Xt)]πf)||E[f(Xts)|X0=x]πf|+|E[f(Xt)]πf|(1γ)tsπmin1f+(1γ)tπmin1f.

Plugging back above,

|Cov[f(Xs),f(Xt)]|x|E[f(Xt)E[f(Xt)]|Xs=x]||f(x)E[f(Xs)]|P[Xs=x]x((1γ)tsπmin1f+(1γ)tπmin1f)|f(x)E[f(Xs)]|P[Xs=x]((1γ)tsπmin1f+(1γ)tπmin1f)x2fP[Xs=x]4(1γ)tsπmin1f2,

where we used that (1γ)t(1γ)ts since (1γ)<1 and tst.

Returning to the sum over the covariances, the previous bound gives

|2T21s<tTCov[f(Xs),f(Xt)]|2T21s<tT|Cov[f(Xs),f(Xt)]|2T21s<tT4(1γ)tsπmin1f2.

To evaluate the sum we make the change of variable h=ts to get that the previous expression is

4πmin1f22T21sTh=1Ts(1γ)h4πmin1f22T21sTh=0+(1γ)h=4πmin1f22T21sT1γ=8πmin1f2γ11T0,

as T+.

We have shown that

Var[1Tt=1Tf(Xt)]f21T+8πmin1f2γ11T9πmin1f2γ11T.

For any ε>0

P[|1Tt=1Tf(Xt)πf|ε]=P[|1Tt=1Tf(Xt)E[1Tt=1Tf(Xt)]+(E[1Tt=1Tf(Xt)]πf)|ε]P[|1Tt=1Tf(Xt)E[1Tt=1Tf(Xt)]|+|E[1Tt=1Tf(Xt)]πf|ε]P[|1Tt=1Tf(Xt)E[1Tt=1Tf(Xt)]|επmin1fγ11T].

We can now apply Chebyshev to get

P[|1Tt=1Tf(Xt)πf|ε]9πmin1f2γ11T(επmin1fγ11T)20,

as T+.

7.8.3. Additional proofs#

Proof of the convergence theorem We prove the convergence to equilibrium theorem in the irreducible, lazy case. Throughout this section, (Xt)t0 is an irreducible, lazy Markov chain on state space S=[n] with transition matrix P=(pi,j)i,j=1n, initial distribution μ=(μ1,,μn) and unique stationary distribution π=(π1,,πn).

We give a probabilistic proof of the Convergence to Equilibrium Theorem. The proof uses a clever idea: coupling. Separately from (Xt)t0, we consider an independent Markov chain (Yt)t0 with the same transition matrix but initial distribution π. By the definition of stationarity,

P[Yt=i]=πi,

for all i and all t. Hence it suffices to show that

|P[Xt=i]πi|=|P[Xt=i]P[Yt=i]|0

as t+ for all i.

Step 1: Showing the joint process is Markov. We observe first that the joint process (X0,Y0),(X1,Y1), is itself a Markov chain! Let’s just check the definition. By the independence of (Xt) and (Yt),

P[(Xt,Yt)=(xt,yt)|(Xt1,Yt1)=(xt1,yt1),,(X0,Y0)=(x0,y0)]=P[(Xt,Yt)=(xt,yt),(Xt1,Yt1)=(xt1,yt1),,(X0,Y0)=(x0,y0)]P[(Xt1,Yt1)=(xt1,yt1),,(X0,Y0)=(x0,y0)]=P[Xt=xt,Xt1=xt1,,X0=x0]P[Yt=yt,Yt1=yt1,,Y0=y0]P[Xt1=xt1,,X0=x0]P[Yt1=yt1,,Y0=y0]=P[Xt=xt,Xt1=xt1,,X0=x0]P[Xt1=xt1,,X0=x0]P[Yt=yt,Yt1=yt1,,Y0=y0]P[Yt1=yt1,,Y0=y0]=P[Xt=xt|Xt1=xt1,,X0=x0]P[Yt=yt|Yt1=yt1,,Y0=y0].

Now we use the fact that each is separately a Markov chain to simplify this last expression

=P[Xt=xt|Xt1=xt1]P[Yt=yt|Yt1=yt1]=P[Xt=xt,Xt1=xt1]P[Xt1=xt1]P[Yt=yt,Yt1=yt1]P[Yt1=yt1]=P[Xt=xt,Xt1=xt1]P[Yt=yt,Yt1=yt1]P[Xt1=xt1]P[Yt1=yt1]=P[(Xt,Yt)=(xt,yt),(Xt1,Yt1)=(xt1,yt1)]P[(Xt1,Yt1)=(xt1,yt1)]=P[(Xt,Yt)=(xt,yt)|(Xt1,Yt1)=(xt1,yt1)].

That proves the claim.

Step 2: Waiting until the marginal processes meet. The idea of the coupling argument is to consider the first time T that XT=YT. Note that T is a random time. But it is a special kind of random time often referred to as a stopping time. That is, the event {T=s} only depends on the trajectory of the joint chain (Xt,Yt) up to time s. Specifically,

{T=s}={((X0,Y0),,(Xs1,Ys1))Ns12,Xs=Ys}

where

Ns12={((x0,y0),,(xs1,ys1))[S×S]s:xiyi,0is1}.

Here is a remarkable observation. The distributions of Xs and Ys are the same after the coupling time T.

LEMMA (Distribution after Coupling) For all s and all i,

P[Xs=i,Ts]=P[Ys=i,Ts].

Proof: We sum over the possible values of T and XT=YT, and use the multiplication rule

P[Xs=i,Ts]=j=1nh=0sP[Xs=i,T=h,Xh=j]=j=1nh=0sP[Xs=i|T=h,Xh=j]P[T=h,Xh=j]=j=1nh=0sP[Xs=i|((X0,Y0),,(Xh1,Yh1))Nh12,(Xh,Yh)=(j,j)]×P[((X0,Y0),,(Xh1,Yh1))Nh12,(Xh,Yh)=(j,j)].

Using the Markov property for the joint process, in the form stated in Exercise 3.36, we get that this last expression is

=j=1nh=0sP[Xs=i|(Xh,Yh)=(j,j)]×P[((X0,Y0),,(Xh1,Yh1))Nh12,(Xh,Yh)=(j,j)].

By the independence of the marginal processes and the fact that they have the same transition matrix, this is

=j=1nh=0sP[Xs=i|Xh=j]×P[((X0,Y0),,(Xh1,Yh1))Nh12,(Xh,Yh)=(j,j)]=j=1nh=0sP[Ys=i|Yh=j]×P[((X0,Y0),,(Xh1,Yh1))Nh12,(Xh,Yh)=(j,j)].

Arguing backwards gives P[Ys=i,Ts] and concludes the proof.

Step 3: Bounding how long it takes for the marginal processes to meet. Since

P[Xt=i]=P[Xt=i,Tt]+P[Xt=i,T>t]

and similarly for P[Yt=i], using the previous lemma we get

|P[Xt=i]P[Yt=i]|=|P[Xt=i,T>t]P[Yt=i,T>t]|P[Xt=i,T>t]+P[Yt=i,T>t]2P[T>t].

So it remains to show that P[T>t] goes to 0 as t+.

We note that P[T>t] is non-increasing as a function of t. Indeed, for h>0, we have the implication

{T>t+h}{T>t},

so by the monotonicity of probabilities

P[T>t+h]P[T>t].

So it remains to prove the following lemma.

LEMMA (Tail of Coupling Time) There is a 0<β<1 and a positive integer m such that, for all positive integers k,

P[T>km]βkm.

Proof: Recall that the state space of the marginal processes is [n], so the joint process lives in [n]×[n]. Since the event {(Xm,Ym)=(1,1)} implies {Tm}, to bound P[T>m] we note that

P[T>m]=1P[Tm]1P[(Xm,Ym)=(1,1)].

To bound the probability on right-hand side, we construct a path of length m in the transition graph of the joint process from any state to (1,1).

For i[n], let Pi=(zi,0,,zi,i) be the shortest path in the transition graph of (Xt)t0 from i to 1, where zi,0=i and zi,i=1. By irreducibility there exists such a path for any i. Here i is the length of Pi, and we define

=maxi1i.

We make all the paths above the same length by padding them with 1s. That is, we define Pi=(zi,0,,zi,i,1,,1) such that this path now has length . This remains a path in the transition graph of (Xt)t0 because the chain is lazy.

Now, for any pair of states (i,j)[n]×[n], consider the path of length m:=2

Q(i,j)=((zi,0,j),,(zi,i,j),(1,j),,(1,j),(1,zj,0),,(1,zi,i),(1,1),,(1,1)).

In words, we leave the second component at j while running through path Pi on the first component, then we leave the first component at 1 while running through path Pj on the second component. Path Q(i,j) is a valid path in the transition graph of the joint chain. Again, we are using that the marginal processes are lazy.

Denote by Q(i,j)[r] be the r-th state in path Q(i,j). Going back to bounding P[(Xm,Ym)=(1,1)], we define β as follows

P[(Xm,Ym)=(1,1)|(X0,Y0)=(i,j)]min(i,j)P[(Xr,Yr)=Q(i,j)[r],r=0,,m|(X0,Y0)=(i0,j0)]=:1β(0,1).

By the law of total probability,

P[(Xm,Ym)=(1,1)]=(i,j)[n]×[n]P[(Xm,Ym)=(1,1)|(X0,Y0)=(i,j)]P[(X0,Y0)=(i,j)]=(i,j)[n]×[n]P[(Xm,Ym)=(1,1)|(X0,Y0)=(i,j)]μiπj(i,j)[n]×[n]P[(Xr,Yr)=Q(i,j)[r],r=0,,m|(X0,Y0)=(i,j)]μiπj(1β)(i,j)[n]×[n]μiπj=1β.

So we have

P[T>m]β.

Because

{T>2m}{T>m},

it holds that by the multiplication rule that

P[T>2m]=P[{T>2m}{T>m}]=P[T>2m|T>m]P[T>m]P[T>2m|T>m]β.

Summing over all state pairs at time m, we get

P[T>2m|T>m]=(i,j)[n]×[n]P[T>2m|T>m,(Xm,Ym)=(i,j)]P[(Xm,Ym)=(i,j)|T>m].

Arguing as above, note that for ij

P[T2m|T>m,(Xm,Ym)=(i,j)]P[(X2m,Y2m)=(1,1)|T>m,(Xm,Ym)=(i,j)]=P[(X2m,Y2m)=(1,1)|(Xm,Ym)=(i,j),((X0,Y0),,(Xm1,Ym1))Nm12]=P[(X2m,Y2m)=(1,1)|(Xm,Ym)=(i,j)]=P[(Xm,Ym)=(1,1)|(X0,Y0)=(i,j)]1β,

by the Markov property and the time-homogeneity of the process.

Plugging above and noting that P[(Xm,Ym)=(i,j)|T>m]=0 when i=j, we get that

P[T>2m|T>m](i,j)[n]×[n]ijβP[(Xm,Ym)=(i,j)|T>m]=β.

So we have proved that P[T>2m]β2. Proceeding similarly by induction gives the claim.