7.8. Online supplementary material#
7.8.1. Quizzes, solutions, code, etc.#
7.8.1.1. Just the code#
An interactive Jupyter notebook featuring the code in this chapter can be accessed below (Google Colab recommended). You are encouraged to tinker with it. Some suggested computational exercises are scattered throughout. The notebook is also available as a slideshow.
7.8.1.2. Self-assessment quizzes#
A more extensive web version of the self-assessment quizzes is available by following the links below.
7.8.1.3. Auto-quizzes#
Automatically generated quizzes for this chapter can be accessed here (Google Colab recommended).
7.8.1.4. Solutions to odd-numbered warm-up exercises#
(with help from Claude, Gemini, and ChatGPT)
E7.2.1 We need to check that all entries of
Non-negativity: All entries of
Row sums:
Row 1:
Therefore,
E7.2.3
E7.2.5
G = nx.DiGraph()
G.add_nodes_from(range(1, 5))
G.add_edges_from([(1, 2), (1, 4), (2, 1), (2, 3), (3, 2), (3, 4), (4, 1), (4, 3)])
The transition graph has a directed edge from state
E7.2.7
E7.2.9 The probability is given by the entry
Therefore, the probability is
E7.2.11 The marginal distribution at time 2 is given by
Thus,
E7.2.13 The distribution at time 1 is given by
E7.2.15 The chain alternates deterministically between states 1 and 2. Therefore, starting in state 1, it will return to state 1 after exactly
E7.3.1 Yes, the matrix is irreducible. The corresponding transition graph is a cycle, which is strongly connected.
E7.3.3 We need to check if
so
E7.3.5 Let
The first equation simplifies to
E7.3.7 To verify that
The result is equal to
E7.3.9 Let
because each row of a stochastic matrix sums to 1. Therefore,
E7.3.11 Let
Step 1: Write out the system of equations based on
Step 2: Rearrange the equations to have zero on the right-hand side.
Step 3: Simplify the equations.
Step 4: Use the condition
Step 5: Substitute the expression for
Step 6: Simplify the equations further.
Step 7: Solve the linear system of two equations with two unknowns using any suitable method (e.g., substitution, elimination, or matrix inversion).
Using the substitution method: From the second equation, express
Substitute this into the first equation:
Now, substitute
Finally, use
Therefore, the stationary distribution is
E7.4.1 Given the transition matrix of a Markov chain:
determine if the chain is lazy.
E7.4.3 Given a Markov chain with transition matrix:
and initial distribution
E7.5.1 The degree matrix
The degrees are computed by summing the entries in each row of
E7.5.3 A stochastic matrix has rows that sum to 1. The rows of
Thus,
E7.5.5
E7.5.7 First, compute the transition matrix:
Then, compute the modified transition matrix with the damping factor:
E7.5.9 First, compute the transition matrix:
Then, compute the modified transition matrix with the damping factor:
E7.5.11 The new adjacency matrix
A self-loop is added to vertex 4.
E7.5.13 The modified transition matrix
Using the values from
E7.6.1 The acceptance probability is given by
E7.6.3 Since the proposal chain is symmetric, the acceptance probability simplifies to
$
E7.6.5 The acceptance probability is given by
E7.6.7 The conditional probability is given by
E7.6.9 The energy is given by
E7.6.11 The conditional mean for the hidden units is:
For
For
7.8.1.5. Learning outcomes#
Define a discrete-time Markov chain and its state space, initial distribution, and transition probabilities.
Identify Markov chains in real-world scenarios, such as weather patterns or random walks on graphs.
Construct the transition matrix for a time-homogeneous Markov chain.
Construct the transition graph of a Markov chain from its transition matrix.
Prove that the transition matrix of a Markov chain is a stochastic matrix.
Apply the Markov property to calculate the probability of specific sequences of events in a Markov chain using the distribution of a sample path.
Compute the marginal distribution of a Markov chain at a specific time using matrix powers.
Use simulation to generate sample paths of a Markov chain.
Define a stationary distribution of a finite-state, discrete-time, time-homogeneous Markov chain and express the defining condition in matrix form.
Explain the relationship between stationary distributions and left eigenvectors of the transition matrix.
Determine whether a given probability distribution is a stationary distribution for a given Markov chain.
Identify irreducible Markov chains and explain their significance in the context of stationary distributions.
State the main theorem on the existence and uniqueness of a stationary distribution for an irreducible Markov chain.
Compute the stationary distribution of a Markov chain numerically using either eigenvalue methods or by solving a linear system of equations, including the “Replace an Equation” method.
Define the concepts of aperiodicity and weak laziness for finite-space discrete-time Markov chains.
State the Convergence to Equilibrium Theorem and the Ergodic Theorem for irreducible Markov chains.
Verify the Ergodic Theorem by simulating a Markov chain and comparing the frequency of visits to each state with the stationary distribution.
Explain the concept of coupling and its role in proving the Convergence to Equilibrium Theorem for irreducible, weakly lazy Markov chains.
Define random walks on directed and undirected graphs, and express their transition matrices in terms of the adjacency matrix.
Determine the stationary distribution of a random walk on an undirected graph and explain its relation to degree centrality.
Explain the concept of reversibility and its connection to the stationary distribution of a random walk.
Describe the PageRank algorithm and its interpretation as a modified random walk on a directed graph.
Compute the PageRank vector by finding the stationary distribution of the modified random walk using power iteration.
Apply the PageRank algorithm to real-world datasets to identify central nodes in a network.
Explain the concept of Personalized PageRank and how it differs from the standard PageRank algorithm.
Modify the PageRank algorithm to implement Personalized PageRank and interpret the results based on the chosen teleportation distribution.
Describe the concept of Markov Chain Monte Carlo (MCMC) and its application in sampling from complex probability distributions.
Explain the Metropolis-Hastings algorithm, including the proposal distribution and acceptance-rejection steps.
Calculate acceptance probabilities in the Metropolis-Hastings algorithm for a given target and proposal distribution.
Prove the correctness of the Metropolis-Hastings algorithm by showing that the resulting Markov chain is irreducible and reversible with respect to the target distribution.
Implement the Gibbs sampling algorithm for a given probabilistic model, such as a Restricted Boltzmann Machine (RBM).
Analyze the conditional independence properties of RBMs and their role in Gibbs sampling.
7.8.2. Additional sections#
7.8.2.1. Random walk on a weighted graph#
The previous definitions extend naturally to the weighted case. Again we allow loops, i.e., self-weights
which includes the self-weight
DEFINITION (Random Walk on a Weighted Graph) Let
Once again, it is easily seen that the transition matrix of random walk on
EXAMPLE: (A Weighted Graph) Here is another example. Consider the following adjacency matrix on
A_ex = np.array([
[0, 8, 0, 1, 0],
[8, 0, 6, 1, 0],
[0, 6, 0, 0, 7],
[1, 1, 0, 0, 10],
[0, 0, 7, 10, 0]])
It is indeed a symmetric matrix.
print(LA.norm(A_ex.T - A_ex))
0.0
We define a graph from its adjacency matrix.
n_ex = A_ex.shape[0] # number of vertices
G_ex = nx.from_numpy_array(A_ex) # graph
To draw it, we first define edge labels by creating a dictionary that assigns to each edge (as a tuple) its weight. Here G.edges.data('weight')
(see G.edges
) iterates through the edges (u,v)
and includes their weight as the third entry of the tuple (u,v,w)
. Then we use the function networkx.draw_networkx_edge_labels()
to add the weights as edge labels.
edge_labels = {}
for (u,v,w) in G_ex.edges.data('weight'):
edge_labels[(u,v)] = w
pos=nx.circular_layout(G_ex)
nx.draw_networkx(G_ex, pos, labels={i: i+1 for i in range(n_ex)},
node_color='black', font_color='white')
edge_labels = nx.draw_networkx_edge_labels(G_ex, pos, edge_labels=edge_labels)
plt.axis('off')
plt.show()

The transition matrix of the random walk on this graph can be computed using the lemma above. We first compute the degree matrix, then apply the formula.
degrees_ex = A_ex @ np.ones(n_ex)
inv_degrees_ex = 1/ degrees_ex
Dinv_ex = np.diag(inv_degrees_ex)
P_ex = Dinv_ex @ A_ex
print(P_ex)
[[0. 0.88888889 0. 0.11111111 0. ]
[0.53333333 0. 0.4 0.06666667 0. ]
[0. 0.46153846 0. 0. 0.53846154]
[0.08333333 0.08333333 0. 0. 0.83333333]
[0. 0. 0.41176471 0.58823529 0. ]]
This is indeed a stochastic matrix.
print(P_ex @ np.ones(n_ex))
[1. 1. 1. 1. 1.]
LEMMA (Irreducibility in Undirected Case) Let
THEOREM (Stationary Distribution on a Graph) Let
EXAMPLE: (A Weighted Graph, continued) Going back to our weighted graph example, we use the previous theorem to compute the stationary distribution. Note that the graph is indeed connected so the stationary distribution is unique. We have already computed the degrees.
print(degrees_ex)
[ 9. 15. 13. 12. 17.]
We compute
total_degrees_ex = degrees_ex @ np.ones(n_ex)
print(total_degrees_ex)
66.0
Finally, the stationary distribution is:
pi_ex = degrees_ex / total_degrees_ex
print(pi_ex)
[0.13636364 0.22727273 0.1969697 0.18181818 0.25757576]
We check stationarity.
print(LA.norm(P_ex.T @ pi_ex - pi_ex))
2.7755575615628914e-17
A random walk on a weighted undirected graph is reversible. Vice versa, it turns out that any reversible chain can be seen as a random walk on an appropriately defined weighted undirected graph. See the exercises.
7.8.2.2. Spectral techniques for random walks on graphs#
In this section, we use techniques from spectral graph theory to analyze random walks on graphs.
Applying the spectral theorem via the normalized Laplacian We have seen how to compute the unique stationary distribution
Recall that the normalized Laplacian of a weighted graph
Recall that in the weighted case, the degree is defined as
where the
Moreover,
which makes
We return to the eigenvectors of
where
THEOREM (Eigendecomposition of Random Walk on a Graph) Let
where the columns of
where
Proof: We write
Rearranging this becomes
Hence for all
Because
We also note that
by the orthonormality of the eigenvectors of the normalized Laplacian. So the columns of
That gives the first claim.
To get the second claim, we first note that (for similar calculations, see the definition of an SVD)
where
We then use the expression for
while
So
That proves the second claim.
If
Limit theorems revisited The Convergence to Equilibrium Theorem implies that in the irreducible, aperiodic case
as
DEFINITION (Spectral Gap) Let
THEOREM (Convergence to Equilibrium: Reversible Case) Let
as
where
Proof: Similarly to the Power Iteration Lemma, using the Eigendecomposition of Random Walk on a Graph we get
By induction,
by calculations similar to the proof of the Eigendecomposition of Random Walk on a Graph.
In the irreducible, lazy case, for
Moreover,
We then use Cauchy-Schwarz and the fact that
We get that the above is
We record an immediate corollary that will be useful next. Let
It will be convenient to use to
THEOREM For any initial distribution
Proof: By the Time Marginals Theorem,
Because
by the triangle inequality.
Now by the theorem this is
That proves the claim.
We also prove a version of the Ergodic Theorem.
THEOREM (Ergodic Theorem: Reversible Case) Let
in probability as
Proof idea: We use Chebyshev’s Inequality. By the Convergence Theorem: Reversible Case, the expectation converges to the limit. To bound the variance, we use the Eigendecomposition of Random Walk on a Graph.
Proof: We use Chebyshev’s Inequality, similarly to our proof of the Weak Law of Large Numbers. However, unlike that case, here the terms in the sum are not independent are require some finessing. Define again the (column) vector
By the corollary, the expectation of the sum can be bounded as follows
as
Next we bound the variance of the sum. By the Variance of a Sum,
We bound the variance and covariance separately using the Eigendecomposition of Random Walk on a Graph.
To obtain convergence, a trivial bound on the variance suffices. Then
Hence,
as
Bounding the covariance requires a more delicate argument. Fix
We now use the time-homogeneity of the chain to note that
We now use the corollary. This expression in absolute value is
Plugging back above,
where we used that
Returning to the sum over the covariances, the previous bound gives
To evaluate the sum we make the change of variable
as
We have shown that
For any
We can now apply Chebyshev to get
as
7.8.3. Additional proofs#
Proof of the convergence theorem We prove the convergence to equilibrium theorem in the irreducible, lazy case. Throughout this section,
We give a probabilistic proof of the Convergence to Equilibrium Theorem. The proof uses a clever idea: coupling. Separately from
for all
as
Step 1: Showing the joint process is Markov. We observe first that the joint process
Now we use the fact that each is separately a Markov chain to simplify this last expression
That proves the claim.
Step 2: Waiting until the marginal processes meet. The idea of the coupling argument is to consider the first time
where
Here is a remarkable observation. The distributions of
LEMMA (Distribution after Coupling) For all
Proof: We sum over the possible values of
Using the Markov property for the joint process, in the form stated in Exercise 3.36, we get that this last expression is
By the independence of the marginal processes and the fact that they have the same transition matrix, this is
Arguing backwards gives
Step 3: Bounding how long it takes for the marginal processes to meet. Since
and similarly for
So it remains to show that
We note that
so by the monotonicity of probabilities
So it remains to prove the following lemma.
LEMMA (Tail of Coupling Time) There is a
Proof: Recall that the state space of the marginal processes is
To bound the probability on right-hand side, we construct a path of length
For
We make all the paths above the same length
Now, for any pair of states
In words, we leave the second component at
Denote by
By the law of total probability,
So we have
Because
it holds that by the multiplication rule that
Summing over all state pairs at time
Arguing as above, note that for
by the Markov property and the time-homogeneity of the process.
Plugging above and noting that
So we have proved that