\(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\S}{\mathcal{S}}\) \(\newcommand{\indep}{\perp\!\!\!\perp}\) \(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\)

5.2. Background: review of conditional probability#

We first review the concept of conditioning, which generally plays a key role in probabilistic modeling and reasoning.

5.2.1. Conditional probability#

We start with events. Throughout, we work on a fixed probability space \((\Omega, \mathcal{F}, \P)\), which we assume is discrete, i.e., the number of elements in \(\Omega\) is countable.

DEFINITION (Conditional Probability) Let \(A\) and \(B\) be two events with \(\mathbb{P}[B] > 0\). The conditional probability of \(A\) given \(B\) is defined as

\[ \P[A|B] = \frac{\P[A \cap B]}{\P[B]}. \]

\(\natural\)

The intuitive interpretation goes something like this: knowing that event \(B\) has occurred, the updated probability of observing \(A\) is the probability of its restriction to \(B\) properly normalized to reflect that outcomes outside \(B\) have updated probability \(0\).

Conditional probabilities generally behave like unconditional probabilities.

Independence can be characterized in terms of conditional probability. In words, \(A\) and \(B\) are independent if conditioning on one of them having taken place does not change the probability of the other occurring.

LEMMA (Independence and Conditional Probability) Let \(A\) and \(B\) be two events of positive probability. Then \(A\) and \(B\) are independent, which we will denote as \(A \indep B\), if and only if \(\P[A|B] = \P[A]\) and \(\P[B|A] = \P[B]\). \(\flat\)

Proof: If \(A\) and \(B\) are independent, then c which implies

\[ \P[A|B] = \frac{\P[A \cap B]}{\P[B]} = \frac{\P[A] \P[B]}{\P[B]} = \P[A]. \]

In the other direction,

\[ \P[A] = \P[A|B] = \frac{\P[A \cap B]}{\P[B]} \]

implies \(\P[A|B] = \frac{\P[A \cap B]}{\P[B]}\) after rearranging. \(\square\)

The conditional probability is often used in three fundamental ways, which we recall next. Proofs can be found in most probability textbooks.

  • Multiplication Rule: For any collection of events \(A_1,\ldots,A_r\),

\[ \P\left[\cap_{i=1}^r A_i\right] = \prod_{i=1}^r \P\left[A_i \,\middle|\, \cap_{j=1}^{i-1} A_j \right]. \]
  • Law of Total Probability: For any event \(B\) and any partition \(A_1,\ldots,A_r\) of \(\Omega\),

\[ \P[B] = \sum_{i=1}^r \P[B|A_i] \P[A_i]. \]
  • Bayes’ Rule: For any events \(A\) and \(B\) with positive probability,

\[ \P[A|B] = \frac{\P[B|A]\P[A]}{\P[B]}. \]

It is implicit that all formulas above hold provided all conditional probabilities are well-defined.

5.2.2. Conditioning on a random variable#

Conditional probabilities extend naturally to random variables. If \(X\) is a discrete random variable, we let \(p_X\) be its probability mass function and \(\S_X\) be its support, that is, the set of values where it has positive probability. Then we can for instance condition on the event \(\{X = x\}\) for any \(x \in \S_X\).

We define next the conditional probability mass function.

DEFINITION (Conditional Probability Mass Function) Let \(X\) and \(Y\) be discrete random variables with joint probability mass function \(p_{X, Y}\) and marginals \(p_X\) and \(p_Y\). The conditional probability mass function of \(X\) given \(Y\) is defined as

\[ p_{X|Y}(x|y) := P[X=x|Y=y] = \frac{p_{X,Y}(x,y)}{p_Y(y)} \]

which is defined for all \(x \in \S_X\) and \(y \in \S_Y\). \(\natural\)

The conditional expectation can then be defined in a natural way as the expectation over the conditional probability mass function.

DEFINITION (Conditional Expectation) Let \(X\) and \(Y\) be discrete random variables where \(X\) takes real values and has a finite mean. The conditional expectation of \(X\) given \(Y = y\) is given by

\[ \E[X|Y=y] = \sum_{x \in \S_X} x\, p_{X|Y}(x|y). \]

\(\natural\)

More generally, for a function \(f\) over the range of \(X\), we can define

\[ \E[f(X)|Y=y] = \sum_{x \in \S_X} f(x)\, p_{X|Y}(x|y). \]

We mention one useful formula: the Law of Total Expectation, the expectation version of the Law of Total Probability. It reads

\[ \E[f(X)] = \sum_{y \in \S_Y} \E[f(X)|Y=y] \,p_Y(y). \]

5.2.3. Conditional expectation as least-squares estimator#

Thinking of \(\E[X|Y=y]\) as a function of \(y\) leads to a fundamental characterization of the conditional expectation.

THEOREM (Conditional Expectation and Least Squares) Let \(X\) and \(Y\) be discrete random variables where \(X\) takes real values and has a finite variance. Then the conditional expectation \(h(y) = \E[X|Y=y]\) minimizes the least squares criterion

\[ \min_{h} \E\left[(X - h(Y))^2\right] \]

where the minimum is over all real-valued functions of \(y\). \(\sharp\)

Proof: Think of \(h(y)\) as a vector \(\mathbf{h} = (h_y)_{y \in \S_Y}\), indexed by \(\S_Y\) (which is countable by assumption), with \(h_y = h(y) \in \mathbb{R}\). Then

\[\begin{align*} \mathcal{L}(\mathbf{h}) &=\E\left[(X - h(Y))^2\right]\\ &= \sum_{x\in \S_X} \sum_{y \in \S_Y} (x - h_y)^2 p_{X,Y}(x,y)\\ &= \sum_{y \in \S_Y} \left[\sum_{x\in \S_X} (x - h_y)^2 p_{X,Y}(x,y)\right]. \end{align*}\]

Expanding the sum in the square brackets (which we denote \(q_y\) and think of as a function of \(h_y\)) gives

\[\begin{align*} q_y(h_y) &:= \sum_{x\in \S_X} (x - h_y)^2 p_{X,Y}(x,y)\\ &= \sum_{x\in \S_X} [x^2 - 2 x h_y + h_y^2] \,p_{X,Y}(x,y)\\ &= \left\{\sum_{x\in \S_X} x^2 p_{X,Y}(x,y)\right\} + \left\{- 2 \sum_{x\in \S_X} x p_{X,Y}(x,y)\right\} h_y + \left\{p_Y(y)\right\} h_y^2. \end{align*}\]

By the Miminizing a Quadratic Function Lemma, the unique global minimum of \(q_y(h_y)\) - provided \(p_Y(y) > 0\) - is attained at

\[ h_y = - \frac{- 2 \sum_{x\in \S_X} x p_{X,Y}(x,y)}{2 p_Y(y)}. \]

After rearranging, we get

\[ h_y = \sum_{x\in \S_X} x \frac{p_{X,Y}(x,y)}{p_Y(y)} = \sum_{x\in \S_X} x p_{X|Y}(x|y) = \E[X|Y=y] \]

as claimed. \(\square\)

5.2.4. Conditional independence#

We begin with the definition.

DEFINITION (Conditional Independence) Let \(A, B, C\) be events such that \(\P[C] > 0\). Then \(A\) and \(B\) are conditionally independent given \(C\), denoted \(A \indep B | C\), if

\[ \P[A \cap B| C] = \P[A|C] \,\P[B|C]. \]

\(\natural\)

In words, quoting [Wikipedia]:

\(A\) and \(B\) are conditionally independent given \(C\) if and only if, given knowledge that \(C\) occurs, knowledge of whether \(A\) occurs provides no information on the likelihood of \(B\) occurring, and knowledge of whether \(B\) occurs provides no information on the likelihood of \(A\) occurring.

In general, conditionally independent events are not (unconditionally) independent.

EXAMPLE: Imagine I have two six-sided dice. Die 1 has faces \(\{1,3,5,7,9,11\}\) and die 2 has faces \(\{2, 4, 6, 8, 10, 12\}\). Suppose I perform the following experiment: I pick one of the two dice uniformly at random, and then I roll that die twice. Let \(X_1\) and \(X_2\) be the outcomes of the rolls. Consider the events \(A = \{X_1 = 1\}\), \(B = \{X_2 = 2\}\), and \(C = \{\text{die 1 is picked}\}\). The events \(A\) and \(B\) are clearly dependent: if \(A\) occurs, then I know that die 1 was picked, and hence \(B\) cannot occur. Knowledge of one event provides information about the likelihood of the other event occurring. Formally, by the law of total probability,

\[ \P[A] = \P[A|C]\P[C] + \P[A|C^c]\P[C^c] = \frac{1}{6}\frac{1}{2} + 0 \frac{1}{2} = \frac{1}{12}. \]

Similarly \(\P[B] = \frac{1}{12}\). Yet \(\P[A \cap B] = 0 \neq \frac{1}{12} \frac{1}{12}\).

On the other hand, we claim that \(A\) and \(B\) are conditionally independent given \(C\). Again this is intuitively clear: once I pick a die, the two rolls are independent. For a given die choice, knowledge of one roll provides no information about the likelihood of the other roll. Note that the phrase “for a given die choice” is critical in the last statement. Formally, by our experiment, we have \(\P[A|C] = 1/6\), \(\P[B|C] = 0\) and \(\P[A \cap B|C] = 0\). So indeed

\[ \P[A \cap B| C] = \P[A|C] \,\P[B|C] \]

as claimed. \(\lhd\)

See the exercises for further instances of the important principle that conditional probabilities satisfy the same rules as probabilities.

Conditional independence is naturally extended to random vectors.

DEFINITION (Conditional Independence of Random Vectors) Let \(\bX, \bY, \bW\) be discrete random vectors. Then \(\bX\) and \(\bY\) are said to be conditionally independent given \(\bW\), denoted \(\bX \indep \bY | \bW\) if for all \(\bx \in \S_\bX\), \(\by \in \S_\bY\) and \(\bw \in \S_\bW\)

\[ \P[\bX = \bx, \bY = \by|\bW = \bw] = \P[\bX = \bx |\bW = \bw] \,\P[\bY = \by|\bW = \bw]. \]

\(\natural\)

An important consequence is that we can drop the conditioning by the independent variable.

LEMMA (Role of Independence) Let \(\bX, \bY, \bW\) be discrete random vectors such that \(\bX \indep \bY | \bW\). For all \(\bx \in \S_\bX\), \(\by \in \S_\bY\) and \(\bw \in \S_\bW\),

\[ \P[\bX = \bx | \bY=\by, \bW=\bw] = \P[\bX = \bx | \bW = \bw]. \]

\(\flat\)

Proof: In a previous exercise, we showed that \(A \indep B | C\) implies \(\P[A | B\cap C] = \P[A | C]\). That implies the claim. \(\square\)