Building blocks of AI 1: backpropagation

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

8.3. Building blocks of AI 1: backpropagation#

We develop the basic mathematical foundations of automatic differentiation. We restrict ourselves to a special setting: multi-layer progressive functions. Many important classifiers take the form of a sequence of compositions where parameters are specific to each layer of composition. We show how to systematically apply the Chain Rule to such functions. We also give a few examples.

8.3.1. Forward v. backward#

We begin with a fixed-parameter example to illustrate the issues. Suppose \(f : \mathbb{R}^d \to \mathbb{R}\) can be expressed as a composition of \(L+1\) vector-valued functions \(\bfg_i : \mathbb{R}^{n_i} \to \mathbb{R}^{n_{i+1}}\) and a real-valued function \(\ell : \mathbb{R}^{n_{L+1}} \to \mathbb{R}\) as follows

\[ f(\mathbf{x}) = \ell \circ \bfg_{L} \circ \bfg_{L-1} \circ \cdots \circ \bfg_1 \circ \bfg_0(\mathbf{x}) = \ell(\bfg_{L}(\bfg_{L-1}(\cdots \bfg_1(\bfg_0(\mathbf{x}))\cdots))). \]

Here \(n_0 = d\) is the input dimension. We also let \(n_{L+1} = K\) be the output dimension. Think of

\[ h(\mathbf{x}) = \bfg_{L}(\bfg_{L-1}(\cdots \bfg_1(\bfg_0(\mathbf{x}))\cdots)) \]

as a prediction function (i.e., a regression or classification function) and think of \(\ell\) as a loss function.

Observe first that the function \(f\) itself is straighforward to compute recursively starting from the inside as follows:

\[\begin{align*} \mathbf{z}_0 &:= \mathbf{x}\\ \mathbf{z}_1 &:= \bfg_0(\mathbf{z}_0)\\ \mathbf{z}_2 &:= \bfg_1(\mathbf{z}_1)\\ \vdots\\ \mathbf{z}_L &:= \bfg_{L-1}(\mathbf{z}_{L-1})\\ \hat{\mathbf{y}} := \mathbf{z}_{L+1} &:= \bfg_{L}(\mathbf{z}_{L})\\ f(\mathbf{x}) &:= \ell(\hat{\mathbf{y}}). \end{align*}\]

Anticipating the setting of neural networks, our main application of interest, we refer to \(\mathbf{z}_0 = \mathbf{x}\) as the “input layer”, \(\hat{\mathbf{y}} = \mathbf{z}_{L+1} = \bfg_{L}(\mathbf{z}_{L})\) as the “output layer”, and \(\mathbf{z}_{1} = \bfg_0(\mathbf{z}_0), \ldots, \mathbf{z}_L = \bfg_{L-1}(\mathbf{z}_{L-1})\) as the “hidden layers”. In particular, \(L\) is the number of hidden layers.

EXAMPLE: We will use the following running example throughout this subsection. We assume that each \(\bfg_i\) is a linear map, that is, \(\bfg_i(\mathbf{z}_i) = \mathcal{W}_{i} \mathbf{z}_i\) where \(\mathcal{W}_{i} \in \mathbb{R}^{n_{i+1} \times n_i}\) is a fixed, known matrix. Assume also that \(\ell : \mathbb{R}^K \to \mathbb{R}\) is defined as

\[ \ell(\hat{\mathbf{y}}) = \frac{1}{2} \|\mathbf{y} - \hat{\mathbf{y}}\|^2, \]

for a fixed, known vector \(\mathbf{y} \in \mathbb{R}^{K}\).

Computing \(f\) recursively starting from the inside as above gives here

\[\begin{align*} \mathbf{z}_0 &:= \mathbf{x}\\ \mathbf{z}_1 &:= \mathcal{W}_{0} \mathbf{z}_0 = \mathcal{W}_{0} \mathbf{x}\\ \mathbf{z}_2 &:= \mathcal{W}_{1} \mathbf{z}_1 = \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ \vdots\\ \mathbf{z}_L &:= \mathcal{W}_{L-1} \mathbf{z}_{L-1} = \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ \hat{\mathbf{y}} := \mathbf{z}_{L+1} &:= \mathcal{W}_{L} \mathbf{z}_{L} = \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ f(\mathbf{x}) &:= \ell(\hat{\mathbf{y}}) = \frac{1}{2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{2}\left\|\mathbf{y} - \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\right\|^2. \end{align*}\]

In essence, we are comparing an observed outcome \(\mathbf{y}\) to a prediction \(\mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\) based on input \(\mathbf{x}\).

In this section, we look into computing the gradient with respect to \(\mathbf{x}\). (In reality, we will be more interested in taking the gradient with respect to the parameters, i.e., the entries of the matrices \(\mathcal{W}_{0}, \ldots, \mathcal{W}_{L}\), a task to which we will come back later in this section. We will also be interested in more complex – in particular, non-linear – prediction functions.) \(\lhd\)

NUMERICAL CORNER: To make things more concrete, we consider a specific example. We will use torch.linalg.vector_norm to compute the Euclidean norm in PyTorch. Suppose \(d=3\), \(L=1\), \(n_1 = 2\), and \(K = 2\) with the following choices:

x = torch.tensor([1.,0.,-1.], requires_grad=True)
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]])
W1 = torch.tensor([[-1.,0.],[2.,-1.]])

z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)

print(z0)
tensor([ 1.,  0., -1.], requires_grad=True)
print(z1)
tensor([1., 1.], grad_fn=<MvBackward0>)
print(z2)
tensor([-1.,  1.], grad_fn=<MvBackward0>)
print(f)
tensor(0.5000, grad_fn=<MulBackward0>)

\(\unlhd\)

Forward mode \(\idx{forward mode}\xdi\) We are ready to apply the Chain Rule

\[ \nabla f(\mathbf{x})^T = J_{f}(\mathbf{x}) = J_{\ell}(\mathbf{z}_{L+1}) J_{\bfg_L}(\mathbf{z}_L) J_{\bfg_{L-1}}(\mathbf{z}_{L-1}) \cdots J_{\bfg_1}(\mathbf{z}_1) J_{\bfg_0}(\mathbf{x}) \]

where the \(\mathbf{z}_{i}\)s are as above and we used that \(\mathbf{z}_0 = \mathbf{x}\). The matrix product here is well-defined. Indeed, the size of \(J_{g_i}(\mathbf{z}_i)\) is \(n_{i+1} \times n_{i}\) (i.e., number of outputs by number of inputs) while the size of \(J_{g_{i-1}}(\mathbf{z}_{i-1})\) is \(n_{i} \times n_{i-1}\) – so the dimensions are compatible.

So it is straighforward to compute \(\nabla f(\mathbf{x})^T\) recursively as we did for \(f\) itself. In fact, we can compute both simultaneously. This is called the forward mode:

\[\begin{align*} \mathbf{z}_0 &:= \mathbf{x}\\ \mathbf{z}_1 &:= \bfg_0(\mathbf{z}_0), \quad F_0 := J_{\bfg_0}(\mathbf{z}_0)\\ \mathbf{z}_2 &:= \bfg_1(\mathbf{z}_1), \quad F_1 := J_{\bfg_1}(\mathbf{z}_1)\, F_0\\ \vdots\\ \mathbf{z}_L &:= \bfg_{L-1}(\mathbf{z}_{L-1}), \quad F_{L-1} := J_{\bfg_{L-1}}(\mathbf{z}_{L-1})\, F_{L-2}\\ \hat{\mathbf{y}} := \mathbf{z}_{L+1} &:= \bfg_{L}(\mathbf{z}_{L}), \quad F_{L} := J_{\bfg_{L}}(\mathbf{z}_{L})\, F_{L-1}\\ f(\mathbf{x}) &:= \ell(\hat{\mathbf{y}}), \quad \nabla f(\mathbf{x})^T := J_{\ell}(\hat{\mathbf{y}}) F_L. \end{align*}\]

EXAMPLE: (continued) We apply this procedure to the running example. The Jacobian of the linear map \(\bfg_i(\mathbf{z}_i) = \mathcal{W}_{i} \mathbf{z}_i\) is the matrix \(\mathcal{W}_{i}\), as we have seen in a previous example. That is, \(J_{\bfg_i}(\mathbf{z}_i) = \mathcal{W}_{i}\) for any \(\mathbf{z}_i\). To compute the Jacobian of \(\ell\), we rewrite it as a quadratic function

\[\begin{align*} \ell(\hat{\mathbf{y}}) &= \frac{1}{2} \|\mathbf{y} - \hat{\mathbf{y}}\|^2\\ &= \frac{1}{2} \mathbf{y}^T\mathbf{y} - \frac{1}{2} 2 \mathbf{y}^T\hat{\mathbf{y}} + \frac{1}{2}\hat{\mathbf{y}}^T \hat{\mathbf{y}}\\ &= \frac{1}{2} \hat{\mathbf{y}}^T I_{n_{L+1} \times n_{L+1}}\hat{\mathbf{y}} + (-\mathbf{y})^T\hat{\mathbf{y}} + \frac{1}{2} \mathbf{y}^T\mathbf{y}. \end{align*}\]

From a previous example,

\[ J_\ell(\hat{\mathbf{y}})^T = \nabla \ell(\hat{\mathbf{y}}) = \frac{1}{2}\left[I_{n_{L+1} \times n_{L+1}} + I_{n_{L+1} \times n_{L+1}}^T\right]\, \hat{\mathbf{y}} + (-\mathbf{y}) = \hat{\mathbf{y}} - \mathbf{y}. \]

Putting it all together, we get

\[\begin{align*} F_0 &:= J_{\bfg_0}(\mathbf{z}_0) = \mathcal{W}_{0}\\ F_1 &:= J_{\bfg_1}(\mathbf{z}_1)\, F_0 = \mathcal{W}_{1} F_0 = \mathcal{W}_{1} \mathcal{W}_{0}\\ \vdots\\ F_{L-1} &:= J_{\bfg_{L-1}}(\mathbf{z}_{L-1})\, F_{L-2} = \mathcal{W}_{L-1} F_{L-2}= \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0}\\ F_{L} &:= J_{\bfg_{L}}(\mathbf{z}_{L})\, F_{L-1} = \mathcal{W}_{L} F_{L-1} = \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0}\\ \nabla f(\mathbf{x})^T &:= J_{\ell}(\hat{\mathbf{y}}) F_L = (\hat{\mathbf{y}} - \mathbf{y})^T F_L = (\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0}\\ &= (\mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x} - \mathbf{y})^T \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0}. \end{align*}\]

\(\lhd\)

NUMERICAL CORNER: We return to our concrete example. Using .T to convert a column vector into a row vector throws an error in PyTorch, as it is meant to be used only on 2D tensors. Instead, one can use torch.unsqueeze. Below, (z2 - y).unsqueeze(0) adds a dimension to z2 - y, making it a 2D tensor with shape \((1, 2)\).

with torch.no_grad():
    F0 = W0
    F1 = W1 @ F0
    grad_f = (z2 - y).unsqueeze(0) @ F1
    
print(F0)
tensor([[ 0.,  1., -1.],
        [ 2.,  0.,  1.]])
Hide code cell source
print(F1)
tensor([[ 0., -1.,  1.],
        [-2.,  2., -3.]])
print(grad_f)
tensor([[ 0.,  1., -1.]])

We can check that we get the same outcome using AD.

f.backward()
print(x.grad)
tensor([ 0.,  1., -1.])

\(\unlhd\)

KNOWLEDGE CHECK: Obtain that last expression directly by taking the gradient of

\[ f(\mathbf{x}) = \frac{1}{2}\|\mathbf{y} - \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\|^2. \]

\(\checkmark\)

Reverse mode \(\idx{reverse mode}\xdi\) What we just described corresponds to performing the matrix products in the Chain Rule formula

\[ \nabla f(\mathbf{x})^T = J_{f}(\mathbf{x}) = J_{\ell}(\hat{\mathbf{y}}) J_{\bfg_L}(\mathbf{z}_L) J_{\bfg_{L-1}}(\mathbf{z}_{L-1}) \cdots J_{\bfg_1}(\mathbf{z}_1) J_{\bfg_0}(\mathbf{x}). \]

from right to left. But there is another way to do this – from left to right! This is called reverse mode. First we compute \(f\) recursively as we did previously, in particular we compute the \(\mathbf{z}_i\)s, then going backwards we compute

\[\begin{align*} G_{L+1} &:= J_{\ell}(\hat{\mathbf{y}})\\ G_{L} &:= G_{L+1}\,J_{g_{L}}(\mathbf{z}_{L})\\ &\vdots\\ G_1 &:= G_2 \, J_{\bfg_1}(\mathbf{z}_1)\\ \nabla f(\mathbf{x})^T &:= G_1 \, J_{\bfg_0}(\mathbf{z}_0) \end{align*}\]

This may seem like a rather minor modification to our previous algorithm. It is not – it is much faster! To see this, we need to count the number of elementary operations involved in implementing each mode.

First a simple calcultion: How many elementary operations (i.e., additions and multiplications) does it take to multiply an \(n \times m\) matrix \(A = (a_{ij})_{i,j}\) and an \(m \times \ell\) matrix \(B= (b_{ij})_{i,j}\)? Let’s ask ChatGPT. Here’s the answer:

“To calculate the number of elementary operations required to multiply an \(n \times m\) matrix by an \(m \times \ell\) matrix, consider the computation involved for each element of the resulting \(n \times \ell\) matrix.

Each element of the resulting matrix is the dot product of a row from the first matrix and a column from the second matrix. If the first matrix is denoted as \(A\) and the second matrix as \(B\), then the element at position \(i, j\) in the resulting matrix \(C\) is calculated as:

\[ c_{ij} = a_{i1}b_{1j} + a_{i2}b_{2j} + \ldots + a_{im}b_{mj} \]

This computation involves \(m\) multiplications and \(m - 1\) additions, making a total of \(m + (m - 1) = 2m - 1\) elementary operations for each element.

Since matrix \(C\) has \(n \times \ell\) elements, the total number of elementary operations required for the entire matrix \(C\) is:

\[ (n \ell)(2m - 1) = 2nm\ell - n\ell \]

Thus, multiplying an \(n \times m\) matrix by an \(m \times \ell\) matrix takes \(2nm\ell - n\ell\) elementary operations.”

Returning to computing the gradient, we count the number of elementary operations needed for each mode. To simplify the calculation, we assume that \(n_0 = d\) and \(n_1 = n_2 = \cdots = n_{L} = m\) and \(n_{L+1} = K\).

Forward: The matrix \(F_0 = J_{\bfg_0}(\mathbf{z}_0)\) has dimensions \(m \times d\). The matrix \(F_1\), as a product of \(J_{\bfg_1}(\mathbf{z}_1) \in \mathbb{R}^{m \times m}\) and \(F_0 \in \mathbb{R}^{m \times d}\) has dimensions \(m \times d\); it therefore takes \(m (2m-1) d\) operations to compute. The same holds for \(F_2, \ldots, F_{L-1}\) (check it!). By similar considerations, the matrix \(F_L\) has dimensions \(K \times d\) and takes \(K (2m-1) d\) operations to compute. Finally, \(\nabla f(\mathbf{x})^T = J_{\ell}(\mathbf{z}_{L+1}) F_L \in \mathbb{R}^{1 \times d}\) and takes \((2K-1) d\) operations to compute. Overall the number of operations is

\[ (L-2) m (2m-1) d + K (2m-1) d + (2K-1) d. \]

This is approximately \(2 L m^2 d\) if we think of \(K\) as a small constant and ignore the smaller order terms.

Reverse: The matrix \(G_{L+1} = J_{\ell}(\mathbf{z}_{L+1})\) has dimensions \(1 \times K\). The matrix \(G_{L}\), as a product of \(G_{L+1} \in \mathbb{R}^{1 \times K}\) and \(J_{g_{L}}(\mathbf{z}_{L}) \in \mathbb{R}^{K \times m}\) has dimensions \(1 \times m\); it therefore takes \((2K-1) m\) operations to compute. The matrix \(G_{L-1}\), as a product of \(G_{L} \in \mathbb{R}^{1 \times m}\) and \(J_{g_{L-1}}(\mathbf{z}_{L-1}) \in \mathbb{R}^{m \times m}\) has dimensions \(1 \times m\); it therefore takes \((2m-1) m\) operations to compute. The same holds for \(G_{L-2}, \ldots, G_{1}\) (check it!). By similar considerations, \(\nabla f(\mathbf{x})^T = G_1 \, J_{\bfg_0}(\mathbf{z}_0) \in \mathbb{R}^{1 \times d}\) and takes \((2m-1) d\) operations to compute. Overall the number of operations is

\[ (2K-1) m + (L-2) (2m-1) m + (2m-1) d. \]

This is approximately \(2 L m^2 + 2 m d\) – which can be much smaller than \(2 L m^2 d\)! In other words, the reverse mode approach can be much faster. Note in particular that all computations in the reverse mode are matrix-vector products (or more precisely row vector-matrix products) rather than matrix-matrix products.

EXAMPLE: (continued) We apply the reverse mode approach to our previous example. We get

\[\begin{align*} G_{L+1} &:= J_{\ell}(\hat{\mathbf{y}}) = (\hat{\mathbf{y}} - \mathbf{y})^T\\ G_{L} &:= G_{L+1}\,J_{g_{L}}(\mathbf{z}_{L}) = G_{L+1} \mathcal{W}_{L} = (\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_{L} \\ \vdots\\ G_1 &:= G_2 \, J_{\bfg_1}(\mathbf{z}_1) = G_2 \mathcal{W}_{1} = [(\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_{L} \cdots \mathcal{W}_{2}] \mathcal{W}_{1} \\ \nabla f(\mathbf{x})^T &:= G_1 \, J_{\bfg_0}(\mathbf{z}_0) = [(\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_{L} \cdots \mathcal{W}_{2}\mathcal{W}_{1}] \mathcal{W}_{0}\\ &= (\mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x} - \mathbf{y})^T \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0}, \end{align*}\]

which matches our previous calculation. Note that all computations involve multiplying a row vector by a matrix. \(\lhd\)

NUMERICAL CORNER: We try our specific example.

with torch.no_grad():
    G2 = (z2 - y).unsqueeze(0)
    G1 = G2 @ W1
    grad_f = G1 @ W0
    
print(G2)
tensor([[-1.,  0.]])
Hide code cell source
print(G1)
tensor([[1., 0.]])
Hide code cell source
print(grad_f)
tensor([[ 0.,  1., -1.]])

We indeed obtain the same answer yet again.

\(\unlhd\)

To provide a little more insight in the savings obtained through the reverse mode, consider the following simple calculations. Let \(A, B \in \mathbb{R}^{n \times n}\) and \(\mathbf{v} \in \mathbb{R}^n\). Suppose we seek to compute \(\mathbf{v}^T B A\). By associativity of matrix multiplication, there are two ways of doing this: compute \(\mathbf{v}^{T}(BA)\) (i.e., first compute \(BA\) then multiply by \(\mathbf{v}^T\); or compute \((\mathbf{v}^T B) A\). The first approach requires \(n^2(2n-1) + n(2n-1)\) operations, while the second only requires \(2n(2n-1)\). The latter is much smaller since \(2 n^3\) (the leading term in the first approach) grows much faster than \(4 n^2\) (the leading term in the second approach) when \(n\) is large.

Why is this happening? One way to understand this is to think of the output \(\mathbf{v}^T B A\) as a linear combination of the rows of \(A\) – a very specific linear combination in fact. In the first approach, we compute \(BA\) which gives us \(n\) different linear combinations of the rows of \(A\) – none being the one we want – and then we compute the desired linear combination by multiplying by \(\mathbf{v}^T\). This is wasteful. In the second approach, we immediately compute the coefficients of the specific linear combination we seek – \(\mathbf{v}^T B\) – and then we compute that linear combination by multiplying to the right by \(A\).

While the setting we examined in this subsection is illuminating, it is not exactly what we want. In the machine learning context, each “layer” \(\bfg_i\) has parameters (in our running example, there were the entries of \(\mathcal{W}_{i}\)) and we seek to optimize with respect to those parameters. For this, we need the gradient with respect to the parameters, not the input \(\mathbf{x}\). In the next subsection, we consider a generalization of the current setting, progressive functions, which will allow us to do this. The notation gets more complicated, but the basic idea remains the same.

8.3.2. Progressive functions#

As mentioned previously, while it may seem natural to define a prediction function \(h\) (e.g., a classifier) as a function of the input data \(\mathbf{x}\in \mathbb{R}^{d}\), when fitting data we are ultimately interested in thinking of \(h\) as a function of the parameters \(\mathbf{w} \in \mathbb{R}^r\) that need to be adjusted – over a fixed dataset. Hence, in this section, the input \(\mathbf{x}\) is fixed while the vector of parameters \(\mathbf{w}\) is now variable.

A first example We use the example from the previous subsection to illustrate the main ideas. That is, suppose \(d=3\), \(L=1\), \(n_1 = 2\), and \(K = 2\). Fix a data sample \(\mathbf{x} = (x_1,x_2,x_3) \in \mathbb{R}^3, \mathbf{y} = (y_1, y_2) \in \mathbb{R}^2\). For \(i=0, 1\), we use the notation

\[\begin{split} \mathcal{W}_{0} = \begin{pmatrix} w_0 & w_1 & w_2\\ w_3 & w_4 & w_5 \end{pmatrix} \quad \text{and} \quad \mathcal{W}_{1} = \begin{pmatrix} w_6 & w_7\\ w_8 & w_9 \end{pmatrix}. \end{split}\]

and let

\[ \ell(\hat{\mathbf{y}}) = \frac{1}{2} \|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{2}(y_1 - \hat{y}_1)^2 + \frac{1}{2}(y_2 - \hat{y}_2)^2. \]

We change the notation for the “layer” function \(\bfg_i\) to reflect the fact that it is now a function of two (concatenated) vectors: the input \(\mathbf{z}_i = (z_{i,1},\ldots,z_{i,n_i})\) from the previous layer and a layer-specific set of parameters \(\mathbf{w}_i\). That is,

\[\begin{split} \bfg_i(\mathbf{z}_i, \mathbf{w}_i) = \mathcal{W}_{i} \mathbf{z}_i = \begin{pmatrix} (\mathbf{w}_i^{(1)})^T\\ (\mathbf{w}_i^{(2)})^T \end{pmatrix} \mathbf{z}_i \end{split}\]

with \(\mathbf{w}_i = (\mathbf{w}_i^{(1)}, \mathbf{w}_i^{(2)})\), the concatenation of the rows of \(\mathcal{W}_{i}\) (as column vectors). A different way to put this is that

\[ \mathbf{w}_i = \mathrm{vec}(\mathcal{W}_{i}^T), \]

where we took the transpose to turn the rows into columns. More specifically,

\[ \bfg_0(\mathbf{z}_0, \mathbf{w}_0) = \mathcal{W}_{0} \mathbf{z}_{0} \quad\text{with}\quad \mathbf{w}_0 = (w_0, w_1, w_2, w_3, w_4, w_5) \]

(i.e., \(\mathbf{w}_0^{(1)} = (w_0, w_1, w_2)\) and \(\mathbf{w}_0^{(2)} = (w_3, w_4, w_5)\)) and

\[ \bfg_1(\mathbf{z}_1, \mathbf{w}_1) = \mathcal{W}_{1} \mathbf{z}_{1} \quad\text{with}\quad \mathbf{w}_1 = (w_6, w_7, w_8, w_9) \]

(i.e., \(\mathbf{w}_1^{(1)} = (w_6, w_7)\) and \(\mathbf{w}_1^{(2)} = (w_8, w_9)\)).

We seek to compute the gradient of

\[\begin{align*} f(\mathbf{w}) &= \ell(\bfg_1(\bfg_0(\mathbf{x},\mathbf{w}_0),\mathbf{w}_1))\\ &= \frac{1}{2} \|\mathbf{y} - \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\|^2\\ &= \frac{1}{2}\left(y_1 - w_6(w_0 x_1 + w_1 x_2 + w_2 x_3) - w_7(w_3 x_1 + w_4 x_2 + w_5 x_3)\right)^2\\ & \qquad + \frac{1}{2}\left(y_2 - w_8(w_0 x_1 + w_1 x_2 + w_2 x_3) - w_9(w_3 x_1 + w_4 x_2 + w_5 x_3)\right)^2. \end{align*}\]

by applying the Chain Rule backwards, as we justified in the previous subsection – but this time we take the gradient with respect to the parameters

\[ \mathbf{w} := (\mathbf{w}_0, \mathbf{w}_1) = (w_0,w_1,\ldots,w_9). \]

Notice a key change in the notation: we now accordingly think of \(f\) as a function of \(\mathbf{w}\); the role of \(\mathbf{x}\) is implicit.

On the other hand, it may seem counter-intuitive that we now think of \(\bfg_i\) as a function of both its own parameters and its inputs from the previous layer when we just stated that we only care about the gradient with respect to the former. But, as we will see, it turns out that we need the Jacobians with respect to both as the input from the previous layer actually depends on the parameters of the previous layers. For instance, \(\bfg_1(\mathbf{z}_1, \mathbf{w}_1) = \mathcal{W}_{1} \mathbf{z}_{1}\) where \(\mathbf{z}_{1} = \bfg_0(\mathbf{z}_0, \mathbf{w}_0) = \mathcal{W}_{0} \mathbf{z}_{0}\).

Recall that we have already computed the requisite Jacobians \(J_{\bfg_0}\) and \(J_{\bfg_1}\) in a previous example. We have also computed the Jacobian \(J_{\ell}\) of \(\ell\). At this point, it is tempting to apply the Chain Rule and deduce that the gradient of \(f\) is

\[ J_{\ell}(\bfg_1(\bfg_0(\mathbf{x},\mathbf{w}_0),\mathbf{w}_1)) \,J_{\bfg_1}(\bfg_0(\mathbf{x},\mathbf{w}_0),\mathbf{w}_1) \,J_{\bfg_0}(\mathbf{x}, \mathbf{w}_0). \]

But this is not correct. For one, the dimensions do not match! For instance, \(J_{\bfg_0} \in \mathbb{R}^{2 \times 9}\) since \(\bfg_0\) has \(2\) outputs and \(9\) inputs (i.e., \(z_{0,1}, z_{0,2}, z_{0,3}, w_0, w_1, w_2, w_3, w_4, w_5\)) while \(J_{\bfg_1} \in \mathbb{R}^{2 \times 6}\) since \(\bfg_1\) has \(2\) outputs and \(6\) inputs (i.e., \(z_{1,1}, z_{1,2}, w_6, w_7, w_8, w_9\)). So what went wrong?

The function \(f\) is not in fact a straight composition of the functions \(\ell\), \(\bfg_1\), and \(\bfg_0\). Indeed the parameters to differentiate with respect to are introduced progressively, each layer injecting its own additional parameters which are not obtained from the previous layers. Hence we cannot write the gradient of \(f\) as a simple product the Jacobians, unlike what happend in the previous subsection.

But not all is lost. We show below that we can still apply the Chain Rule step-by-step in a way that accounts for the additional parameters on each layer. Taking a hint from the previous subsection, we proceed forward first to compute \(f\) and the Jacobians, and then go backwards to compute the gradient \(\nabla f\). We use the notation \(\mathbb{A}_{n}[\mathbf{x}]\) and \(\mathbb{B}_{n}[\mathbf{z}]\) from the background section.

In the forward phase, we compute \(f\) itself and the requisite Jacobians:

\[\begin{align*} &\mathbf{z}_0 := \mathbf{x}\\ & = (x_1, x_2, x_3)\\ &\mathbf{z}_1 := \bfg_0(\mathbf{z}_0, \mathbf{w}_0) = \mathcal{W}_{0} \mathbf{z}_{0}\\ &= \begin{pmatrix} (\mathbf{w}_0^{(1)})^T\mathbf{x}\\ (\mathbf{w}_0^{(2)})^T\mathbf{x}\end{pmatrix} = \begin{pmatrix} w_0 x_1 + w_1 x_2 + w_2 x_3\\ w_3 x_1 + w_4 x_2 + w_5 x_3 \end{pmatrix}\\ &J_{\bfg_0}(\mathbf{z}_0, \mathbf{w}_0) := \begin{pmatrix} \mathbb{A}_{2}[\mathbf{w}_0] & \mathbb{B}_{2}[\mathbf{z}_0] \end{pmatrix} = \begin{pmatrix} \mathcal{W}_{0} & I_{2\times 2} \otimes \mathbf{z}_0^T \end{pmatrix}\\ &= \begin{pmatrix} w_0 & w_1 & w_2 & x_1 & x_2 & x_3 & 0 & 0 & 0\\ w_3 & w_4 & w_5 & 0 & 0 & 0 & x_1 & x_2 & x_3 \end{pmatrix} \end{align*}\]
\[\begin{align*} &\hat{\mathbf{y}} := \mathbf{z}_2 := \bfg_1(\mathbf{z}_1, \mathbf{w}_1) = \mathcal{W}_{1} \mathbf{z}_{1}\\ &= \begin{pmatrix} w_6 z_{1,1} + w_7 z_{1,2}\\ w_8 z_{1,1} + w_9 z_{1,2} \end{pmatrix}\\ &= \begin{pmatrix} w_6 (\mathbf{w}_0^{(1)})^T\mathbf{x} + w_7 (\mathbf{w}_0^{(2)})^T\mathbf{x}\\ w_8 (\mathbf{w}_0^{(1)})^T\mathbf{x} + w_9 (\mathbf{w}_0^{(2)})^T\mathbf{x} \end{pmatrix}\\ &= \begin{pmatrix} w_6(w_0 x_1 + w_1 x_2 + w_2 x_3) + w_7(w_3 x_1 + w_4 x_2 + w_5 x_3)\\ w_8(w_0 x_1 + w_1 x_2 + w_2 x_3) + w_9(w_3 x_1 + w_4 x_2 + w_5 x_3) \end{pmatrix}\\ &J_{\bfg_1}(\mathbf{z}_1, \mathbf{w}_1):= \begin{pmatrix} \mathbb{A}_{2}[\mathbf{w}_1] & \mathbb{B}_{2}[\mathbf{z}_1] \end{pmatrix} = \begin{pmatrix} \mathcal{W}_{1} & I_{2\times 2} \otimes \mathbf{z}_1^T \end{pmatrix}\\ &= \begin{pmatrix} w_6 & w_7 & z_{1,1} & z_{1,2} & 0 & 0\\ w_8 & w_9 & 0 & 0 & z_{1,1} & z_{1,2} \end{pmatrix}\\ &= \begin{pmatrix} w_6 & w_7 & (\mathbf{w}_0^{(1)})^T\mathbf{x} & (\mathbf{w}_0^{(2)})^T\mathbf{x} & 0 & 0\\ w_8 & w_9 & 0 & 0 & (\mathbf{w}_0^{(1)})^T\mathbf{x} & (\mathbf{w}_0^{(2)})^T\mathbf{x} \end{pmatrix} \end{align*}\]
\[\begin{align*} &f(\mathbf{x}) := \ell(\hat{\mathbf{y}}) = \frac{1}{2} \|\mathbf{y} - \hat{\mathbf{y}}\|^2\\ &= \frac{1}{2}\left(y_1 - w_6(w_0 x_1 + w_1 x_2 + w_2 x_3) - w_7(w_3 x_1 + w_4 x_2 + w_5 x_3)\right)^2\\ & \qquad + \frac{1}{2}\left(y_2 - w_8(w_0 x_1 + w_1 x_2 + w_2 x_3) - w_9(w_3 x_1 + w_4 x_2 + w_5 x_3)\right)^2\\ &J_{\ell}(\hat{\mathbf{y}}) = (\hat{\mathbf{y}} - \mathbf{y})^T\\ &= \begin{pmatrix} w_6 (\mathbf{w}_0^{(1)})^T\mathbf{x} + w_7 (\mathbf{w}_0^{(2)})^T\mathbf{x} - y_1 & w_8 (\mathbf{w}_0^{(1)})^T\mathbf{x} + w_9 (\mathbf{w}_0^{(2)})^T\mathbf{x} - y_2 \end{pmatrix}. \end{align*}\]

We now compute the gradient of \(\mathbf{f}\) with respect to \(\mathbf{w}\). We start with \(\mathbf{w}_1 = (w_6, w_7, w_8, w_9)\). For this step, we think of \(f\) as the composition \(\ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))\). Here \(\mathbf{z}_1\) does not depend on \(\mathbf{w}_1\) and therefore can be considered fixed for this calculation. By the Chain Rule

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial w_6} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial w_6} = \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_6} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_6} = (\hat{y}_1 - y_1) z_{1,1} \end{align*}\]

where we used the fact that \(g_{1,2}(\mathbf{z}_1, \mathbf{w}_1) = w_8 z_{1,1} + w_9 z_{1,2}\) does not depend on \(w_6\) and therefore \(\frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_6} = 0\). Similarly

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial w_7} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial w_7} =\frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_7} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_7} = (\hat{y}_1 - y_1) z_{1,2}\\ \frac{\partial f(\mathbf{w})}{\partial w_8} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial w_8} =\frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_8} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_8} = (\hat{y}_1 - y_1) z_{2,1}\\ \frac{\partial f(\mathbf{w})}{\partial w_9} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial w_9} =\frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_9} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial w_9} = (\hat{y}_1 - y_1) z_{2,2}. \end{align*}\]

In matrix form, this is

\[\begin{align*} &\begin{pmatrix}\frac{\partial f(\mathbf{w})}{\partial w_6} & \frac{\partial f(\mathbf{w})}{\partial w_7} & \frac{\partial f(\mathbf{w})}{\partial w_8} & \frac{\partial f(\mathbf{w})}{\partial w_9} \end{pmatrix}\\ &= J_{\ell}(\hat{\mathbf{y}}) \,\mathbb{B}_{2}[\mathbf{z}_1]\\ &= (\hat{\mathbf{y}} - \mathbf{y})^T (I_{2\times 2} \otimes \mathbf{z}_1^T)\\ &= (\hat{\mathbf{y}} - \mathbf{y})^T \otimes \mathbf{z}_1^T\\ &= \begin{pmatrix} (\hat{y}_1 - y_1) z_{1,1} & (\hat{y}_1 - y_1) z_{1,2} & (\hat{y}_1 - y_1) z_{2,1} & (\hat{y}_1 - y_1) z_{2,2} \end{pmatrix} \end{align*}\]

where we used Properties of the Kronecker Product (f) on the second to last line.

To compute the partial derivatives with respect to \(\mathbf{w}_0 = (w_0, w_1, \ldots, w_5)\), we first need to compute partial derivatives with respect to \(\mathbf{z}_1 = (z_{1,1}, z_{1,2})\) since \(f\) depends on \(\mathbf{w}_0\) through it. For this calculation, we think again of \(f\) as the composition \(\ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))\), but this time our focus is on the variables \(\mathbf{z}_1\). We obtain

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial z_{1,1}} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial z_{1,1}}\\ &= \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial z_{1,1}} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial z_{1,1}}\\ &= (\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8 \end{align*}\]

and

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial z_{1,2}} &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial z_{1,2}}\\ &= \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_1} \frac{\partial g_{1,1}(\mathbf{z}_1, \mathbf{w}_1)}{\partial z_{1,2}} + \frac{\partial \ell(\hat{\mathbf{y}})}{\partial \hat{y}_2} \frac{\partial g_{1,2}(\mathbf{z}_1, \mathbf{w}_1)}{\partial z_{1,2}}\\ &= (\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9. \end{align*}\]

In matrix form, this is

\[\begin{align*} &\begin{pmatrix}\frac{\partial f(\mathbf{w})}{\partial z_{1,1}} & \frac{\partial f(\mathbf{w})}{\partial z_{1,2}} \end{pmatrix}\\ &= J_{\ell}(\hat{\mathbf{y}}) \,\mathbb{A}_{2}[\mathbf{z}_1]\\ &= (\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_1\\ &= \begin{pmatrix} (\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8 & (\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9 \end{pmatrix}. \end{align*}\]

The vector \(\left(\frac{\partial f(\mathbf{w})}{\partial z_{1,1}}, \frac{\partial f(\mathbf{w})}{\partial z_{1,2}}\right)\) is called an adjoint.

We now compute the gradient of \(\mathbf{f}\) with respect to \(\mathbf{w}_0 = (w_0, w_1, \ldots, w_5)\). For this step, we think of \(f\) as the composition of \(\ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))\) as a function of \(\mathbf{z}_1\) and \(\bfg_0(\mathbf{z}_0, \mathbf{w}_0)\) as a function of \(\mathbf{w}_0\). Here \(\mathbf{z}_0\) does not depend on \(\mathbf{w}_0\) and therefore can be considered fixed for this calculation. By the Chain Rule

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial w_0} &= \frac{\partial \ell(\bfg_1(\bfg_0(\mathbf{z}_0, \mathbf{w}_0), \mathbf{w}_1))}{\partial w_0}\\ &= \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial z_{1,1}} \frac{\partial g_{0,1}(\mathbf{z}_0, \mathbf{w}_0)}{\partial w_0} + \frac{\partial \ell(\bfg_1(\mathbf{z}_1, \mathbf{w}_1))}{\partial z_{1,2}} \frac{\partial g_{0,2}(\mathbf{z}_0, \mathbf{w}_0)}{\partial w_0}\\ &= ((\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8) z_{0,1}\\ &= ((\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8) x_{1}, \end{align*}\]

where we used the fact that \(g_{0,2}(\mathbf{z}_0, \mathbf{w}_0) = w_3 z_{0,1} + w_4 z_{0,2} + w_5 z_{0,3}\) does not depend on \(w_0\) and therefore \(\frac{\partial g_{0,2}(\mathbf{z}_0, \mathbf{w}_0)}{\partial w_0} = 0\).

Similarly (check it!)

\[\begin{align*} \frac{\partial f(\mathbf{w})}{\partial w_1} &= ((\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8) x_{2}\\ \frac{\partial f(\mathbf{w})}{\partial w_2} &= ((\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8) x_{3}\\ \frac{\partial f(\mathbf{w})}{\partial w_3} &= ((\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9) x_{1}\\ \frac{\partial f(\mathbf{w})}{\partial w_4} &= ((\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9) x_{2}\\ \frac{\partial f(\mathbf{w})}{\partial w_5} &= ((\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9) x_{3}. \end{align*}\]

In matrix form, this is

\[\begin{align*} &\begin{pmatrix}\frac{\partial f(\mathbf{w})}{\partial w_0} & \frac{\partial f(\mathbf{w})}{\partial w_1} & \frac{\partial f(\mathbf{w})}{\partial w_2} & \frac{\partial f(\mathbf{w})}{\partial w_3} & \frac{\partial f(\mathbf{w})}{\partial w_4} & \frac{\partial f(\mathbf{w})}{\partial w_5} \end{pmatrix}\\ &= J_{\ell}(\hat{\mathbf{y}}) \,\mathbb{A}_{2}[\mathbf{z}_1] \,\mathbb{B}_{2}[\mathbf{z}_0]\\ &= (\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_1 (I_{2\times 2} \otimes \mathbf{z}_0^T)\\ &= ((\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_1) \otimes \mathbf{x}^T\\ &= \begin{pmatrix} ((\hat{y}_1 - y_1) w_6 + (\hat{y}_2 - y_2) w_8) x_{1} & \cdots & ((\hat{y}_1 - y_1) w_7 + (\hat{y}_2 - y_2) w_9) x_{3} \end{pmatrix} \end{align*}\]

where we used Properties of the Kronecker Product (f) on the second to last line.

To sum up,

\[ \nabla f (\mathbf{w})^T = \begin{pmatrix} (\hat{\mathbf{y}} - \mathbf{y})^T \otimes (\mathcal{W}_{0} \mathbf{x})^T & ((\hat{\mathbf{y}} - \mathbf{y})^T \mathcal{W}_1) \otimes \mathbf{x}^T \end{pmatrix}. \]

NUMERICAL CORNER: We return to the concrete example from the previous subsection. This time the matrices W0 and W1 require partial derivatives.

x = torch.tensor([1.,0.,-1.])
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]], requires_grad=True)
W1 = torch.tensor([[-1.,0.],[2.,-1.]], requires_grad=True)

z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)

print(z0)
tensor([ 1.,  0., -1.])
print(z1)
tensor([1., 1.], grad_fn=<MvBackward0>)
Hide code cell source
print(z2)
tensor([-1.,  1.], grad_fn=<MvBackward0>)
Hide code cell source
print(f)
tensor(0.5000, grad_fn=<MulBackward0>)

We compute the gradient \(\nabla f(\mathbf{w})\) using AD.

f.backward()
Hide code cell source
print(W0.grad)
tensor([[ 1.,  0., -1.],
        [ 0.,  0., -0.]])
Hide code cell source
print(W1.grad)
tensor([[-1., -1.],
        [-0., -0.]])

These are written in the form of matrix derivatives

\[\begin{split} \frac{\partial f}{\partial \mathcal{W}_0} = \begin{pmatrix} \frac{\partial f}{\partial w_0} & \frac{\partial f}{\partial w_1} & \frac{\partial f}{\partial w_2} \\ \frac{\partial f}{\partial w_3} & \frac{\partial f}{\partial w_4} & \frac{\partial f}{\partial w_5} \end{pmatrix} \quad\text{and}\quad \frac{\partial f}{\partial \mathcal{W}_1} = \begin{pmatrix} \frac{\partial f}{\partial w_6} & \frac{\partial f}{\partial w_7} \\ \frac{\partial f}{\partial w_8} & \frac{\partial f}{\partial w_9} \end{pmatrix}. \end{split}\]

We use our formulas to confirm that they match these results. We need the Kronecker product, which in PyTorch is implemented as torch.kron.

with torch.no_grad():
    grad_W0 = torch.kron((z2 - y).unsqueeze(0) @ W1, z0.unsqueeze(0))
    grad_W1 = torch.kron((z2 - y).unsqueeze(0), z1.unsqueeze(0))

print(grad_W0)
tensor([[ 1.,  0., -1.,  0.,  0., -0.]])
Hide code cell source
print(grad_W1)
tensor([[-1., -1.,  0.,  0.]])

Observe that this time these results are written in vectorized form (i.e., obtained by concatenating the rows). But they do match with the AD output.

\(\unlhd\)

General setting \(\idx{progressive function}\xdi\) More generally, we have \(L+2\) layers. The input layer is \(\mathbf{z}_0 := \mathbf{x}\), which we refer to as layer \(0\). Hidden layer \(i\), \(i=1,\ldots,L\), is defined by a continuously differentiable function \(\mathbf{z}_i := \bfg_{i-1}(\mathbf{z}_{i-1}, \mathbf{w}_{i-1})\) which this time takes two vector-valued inputs: a vector \(\mathbf{z}_{i-1} \in \mathbb{R}^{n_{i-1}}\) fed from the \((i-1)\)-st layer and a vector \(\mathbf{w}_{i-1} \in \mathbb{R}^{r_{i-1}}\) of parameters specific to the \(i\)-th layer

\[ \bfg_{i-1} = (g_{i-1,1},\ldots,g_{i-1,n_{i}}) : \mathbb{R}^{n_{i-1} + r_{i-1}} \to \mathbb{R}^{n_{i}}. \]

The output \(\mathbf{z}_i\) of \(\bfg_{i-1}\) is a vector in \(\mathbb{R}^{n_{i}}\) which is passed to the \((i+1)\)-st layer as input. The output layer is \(\mathbf{z}_{L+1} := \bfg_{L}(\mathbf{z}_{L}, \mathbf{w}_{L})\), which we also refer to as layer \(L+1\).

For \(i = 1,\ldots,L+1\), let

\[ \overline{\mathbf{w}}^{i-1} = (\mathbf{w}_0,\mathbf{w}_1,\ldots,\mathbf{w}_{i-1}) \in \mathbb{R}^{r_0 + r_1+\cdots+r_{i-1}} \]

be the concatenation of the parameters from the first \(i\) layers (not including the input layer, which does not have parameters) as a vector in \(\mathbb{R}^{r_0+r_1+\cdots+r_{i-1}}\). Then the output of layer \(i\) as a function of the parameters is the composition

\[ \mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1}) = \bfg_{i-1}(\mathcal{O}_{i-2}(\overline{\mathbf{w}}^{i-2}), \mathbf{w}_{i-1}) = \bfg_{i-1}(\bfg_{i-2}(\cdots \bfg_1(\bfg_0(\mathbf{x},\mathbf{w}_0),\mathbf{w}_1), \cdots, \mathbf{w}_{i-2}), \mathbf{w}_{i-1}) \in \mathbb{R}^{n_{i}}, \]

for \(i = 2, \ldots, L+1\). When \(i=1\), we have simply

\[ \mathcal{O}_{0}(\overline{\mathbf{w}}^{0}) = \bfg_{0}(\mathbf{x}, \mathbf{w}_0). \]

Observe that the function \(\mathcal{O}_{i-1}\) depends implicitly on the input \(\mathbf{x}\) – which we do not think of as a variable in this setting. To simplify the notation, we do not make the dependence on \(\mathbf{x}\) explicit.

Letting \(\mathbf{w} := \overline{\mathbf{w}}^{L}\), the final output is

\[ \bfh(\mathbf{w}) = \mathcal{O}_{L}(\overline{\mathbf{w}}^{L}). \]

Expanding out the composition, this can be written alternatively as

\[ \bfh(\mathbf{w}) = \bfg_{L}(\bfg_{L-1}(\cdots \bfg_1(\bfg_0(\mathbf{x},\mathbf{w}_0),\mathbf{w}_1), \cdots, \mathbf{w}_{L-1}), \mathbf{w}_{L}). \]

Again, we do not make the dependence on \(\mathbf{x}\) explicit.

As a final step, we have a loss function \(\ell : \mathbb{R}^{n_{L+1}} \to \mathbb{R}\) which takes as input the output of the last layer and measures the fit to the given label \(\mathbf{y} \in \Delta_K\). We will see some example below. The final function is then

\[ f(\mathbf{w}) = \ell(\bfh(\mathbf{w})) \in \mathbb{R}. \]

We seek to compute the gradient of \(f(\mathbf{w})\) with respect to the parameters \(\mathbf{w}\) in order to apply a gradient descent method.

EXAMPLE: (continued) We return to the running example from the previous subsection. That is, \(\bfg_i(\mathbf{z}_i, \mathbf{w}_i) = \mathcal{W}_{i} \mathbf{z}_i\) where the entries of \(\mathcal{W}_{i} \in \mathbb{R}^{n_{i+1} \times n_i}\) are considered parameters and we let \(\mathbf{w}_i = \mathrm{vec}(\mathcal{W}_{i}^T)\). Assume also that \(\ell : \mathbb{R}^K \to \mathbb{R}\) is defined as

\[ \ell(\hat{\mathbf{y}}) = \frac{1}{2} \|\mathbf{y} - \hat{\mathbf{y}}\|^2, \]

for a fixed, known vector \(\mathbf{y} \in \mathbb{R}^{K}\).

Computing \(f\) recursively gives

\[\begin{align*} \mathbf{z}_0 &:= \mathbf{x}\\ \mathbf{z}_1 &:= \mathcal{O}_0(\overline{\mathbf{w}}^0) = \bfg_0(\mathbf{z}_0, \mathbf{w}_0) = \mathcal{W}_{0} \mathbf{z}_0 = \mathcal{W}_{0} \mathbf{x}\\ \mathbf{z}_2 &:= \mathcal{O}_1(\overline{\mathbf{w}}^1) = \bfg_1(\mathbf{z}_1, \mathbf{w}_1) = \mathcal{W}_{1} \mathbf{z}_1 = \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ \vdots\\ \mathbf{z}_L &:= \mathcal{O}_{L-1}(\overline{\mathbf{w}}^{L-1}) = \bfg_{L-1}(\mathbf{z}_{L-1}, \mathbf{w}_{L-1}) = \mathcal{W}_{L-1} \mathbf{z}_{L-1} = \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ \hat{\mathbf{y}} := \mathbf{z}_{L+1} &:= \mathcal{O}_{L}(\overline{\mathbf{w}}^{L}) = \bfg_{L}(\mathbf{z}_{L}, \mathbf{w}_{L}) = \mathcal{W}_{L} \mathbf{z}_{L} = \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\\ f(\mathbf{x}) &:= \ell(\hat{\mathbf{y}}) = \frac{1}{2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 = \frac{1}{2}\left\|\mathbf{y} - \mathcal{W}_{L} \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x}\right\|^2. \end{align*}\]

\(\lhd\)

Applying the chain rule Recall that the key insight from the Chain Rule is that to compute the gradient of a composition such as \(\bfh(\mathbf{w})\) – no matter how complex – it suffices to separately compute the Jacobians of the intervening functions and then take matrix products. In this section, we compute the necessary Jacobians in the progressive case.

It will be convenient to re-write the basic composition step as

\[ \mathcal{O}_{i}(\overline{\mathbf{w}}^{i}) = \bfg_{i}(\mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1}), \mathbf{w}_{i}) = \bfg_{i}(\mathcal{I}_{i}(\overline{\mathbf{w}}^{i})) \in \mathbb{R}^{n_{i+1}}, \]

where the input to layer \(i+1\) (both layer-specific parameters and the output of the previous layer) is

\[ \mathcal{I}_{i}(\overline{\mathbf{w}}^{i}) = \left( \mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1}), \mathbf{w}_{i} \right) \in \mathbb{R}^{n_{i} + r_{i}}, \]

for \(i = 1, \ldots, L\). When \(i=0\), we have simply

\[ \mathcal{I}_{0}(\overline{\mathbf{w}}^{0}) = \left(\mathbf{z}_0, \mathbf{w}_0 \right) = \left(\mathbf{x}, \mathbf{w}_0 \right). \]

Applying the Chain Rule we get

\[ J_{\mathcal{O}_{i}}(\overline{\mathbf{w}}^{i}) = J_{\bfg_i}(\mathcal{I}_{i}(\overline{\mathbf{w}}^{i})) \,J_{\mathcal{I}_{i}}(\overline{\mathbf{w}}^{i}). \]

First, the Jacobian of

\[ \mathcal{I}_{i}(\overline{\mathbf{w}}^{i}) = \left( \mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1}), \mathbf{w}_i \right) \]

has a simple block diagonal structure

\[\begin{split} J_{\mathcal{I}_{i}}(\overline{\mathbf{w}}^{i}) = \begin{pmatrix} J_{\mathcal{O}_{i-1}}(\overline{\mathbf{w}}^{i-1}) & 0 \\ 0 & I_{r_i \times r_i} \end{pmatrix} \in \mathbb{R}^{(n_{i} + r_{i})\times(r_0 + \cdots + r_i)} \end{split}\]

since the first block component of \(\mathcal{I}_{i}\), \(\mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1})\), does not depend on \(\mathbf{w}_i\) whereas the second block component of \(\mathcal{I}_{i}\), \(\mathbf{w}_i\), does not depend on \(\overline{\mathbf{w}}^{i-1}\). Observe that this is a fairly large matrix whose number of columns in particular grows with \(i\). That last formula is for \(i \geq 1\). When \(i=0\) we have \(\mathcal{I}_{0}(\overline{\mathbf{w}}^{0}) = \left(\mathbf{x}, \mathbf{w}_0\right)\), so that

\[\begin{split} J_{\mathcal{I}_{0}}(\overline{\mathbf{w}}^{0}) = \begin{pmatrix} \mathbf{0}_{d \times r_0} \\ I_{r_0 \times r_0} \end{pmatrix} \in \mathbb{R}^{r \times r_0}. \end{split}\]

We partition the Jacobian of \(\bfg_i(\mathbf{z}_i, \mathbf{w}_i)\) likewise, that is, we divide it into those columns corresponding to partial derivatives with respect to \(\mathbf{z}_{i}\) (the corresponding block being denoted by \(A_i\)) and with respect to \(\mathbf{w}_i\) (the corresponding block being denoted by \(B_i\))

\[ J_{\bfg_i}(\mathbf{z}_i, \mathbf{w}_i) = \begin{pmatrix} A_i & B_i \end{pmatrix} \in \mathbb{R}^{n_{i+1} \times (n_i + r_i)}, \]

evaluated at \((\mathbf{z}_i, \mathbf{w}_i) = \mathcal{I}_{i}(\overline{\mathbf{w}}^{i}) = (\mathcal{O}_{i-1}(\overline{\mathbf{w}}^{i-1}), \mathbf{w}_i)\). Note that \(A_i\) and \(B_i\) depend on the details of the function \(\bfg_i\), which typically is fairly simple. We give examples in the next subsections.

Plugging back above we get

\[\begin{split} J_{\mathcal{O}_{i}}(\overline{\mathbf{w}}^{i}) = \begin{pmatrix} A_i & B_i \end{pmatrix} \,\begin{pmatrix} J_{\mathcal{O}_{i-1}}(\overline{\mathbf{w}}^{i-1}) & 0 \\ 0 & I_{r_i \times r_i} \end{pmatrix}. \end{split}\]

This leads to the recursion

\[ J_{\mathcal{O}_{i}}(\overline{\mathbf{w}}^{i}) = \begin{pmatrix} A_i \, J_{\mathcal{O}_{i-1}}(\overline{\mathbf{w}}^{i-1}) & B_i \end{pmatrix} \in \mathbb{R}^{n_{i+1}\times(r_0 + \cdots + r_i)} \]

from which the Jacobian of \(\mathbf{h}(\mathbf{w})\) can be computed. Like \(J_{\mathcal{I}_{i}}\), \(J_{\mathcal{O}_{i}}\) is a large matrix. We refer to this matrix equation as the fundamental recursion.

The base case \(i=0\) is

\[\begin{split} J_{\mathcal{O}_{0}}(\overline{\mathbf{w}}^{0}) = \begin{pmatrix} A_0 & B_0 \end{pmatrix}\begin{pmatrix} \mathbf{0}_{d \times r_0} \\ I_{r_0 \times r_0} \end{pmatrix} = B_0. \end{split}\]

Finally, using the Chain Rule again

\[\begin{align*} \nabla {f(\mathbf{w})} &= J_{f}(\mathbf{w})^T\\ &= [J_{\ell}(\bfh(\mathbf{w})) \,J_{\bfh}(\mathbf{w})]^T\\ &= J_{\bfh}(\mathbf{w})^T \,\nabla {\ell}(\bfh(\mathbf{w}))\\ &= J_{\mathcal{O}_{L}}(\overline{\mathbf{w}}^{L})^T \,\nabla {\ell}(\mathcal{O}_{L}(\overline{\mathbf{w}}^{L})). \end{align*}\]

The matrix \(J_{\mathcal{O}_{L}}(\overline{\mathbf{w}}^{L})\) is computed using the recursion above, while \(\nabla {\ell}\) depends on the function \(\ell\).

Backpropagation \(\idx{backpropagation}\xdi\) We take advantage of the fundamental recursion to compute the gradient of \(\bfh\). As we have seen, there are two ways of doing this. Applying the recursion directly is one of them, but it requires many matrix-matrix products. The first few steps are

\[ J_{\mathcal{O}_{0}}(\overline{\mathbf{w}}^{0}) = B_0, \]
\[ J_{\mathcal{O}_{1}}(\overline{\mathbf{w}}^{1}) = \begin{pmatrix} A_1 J_{\mathcal{O}_{0}}(\overline{\mathbf{w}}^{0}) & B_1 \end{pmatrix} \]
\[ J_{\mathcal{O}_{2}}(\overline{\mathbf{w}}^{2}) = \begin{pmatrix} A_2 \, J_{\mathcal{O}_{1}}(\overline{\mathbf{w}}^{1}) & B_2 \end{pmatrix}, \]

and so on.

Instead, as in the case of differentiating with respect to the input \(\mathbf{x}\), one can also run the recursion backwards. The latter approach can be much faster because, as we detail next, it involves only matrix-vector products. Start from the end, that is, with the equation

\[ \nabla {f}(\mathbf{w}) = J_{\bfh}(\mathbf{w})^T \,\nabla {\ell}(\bfh(\mathbf{w})). \]

Note that \(\nabla {\ell}(\bfh(\mathbf{w}))\) is a vector – not a matrix. Then expand the matrix \(J_{\bfh}(\mathbf{w})\) using the recursion above

\[\begin{align*} \nabla {f}(\mathbf{w}) &= J_{\bfh}(\mathbf{w})^T \,\nabla {\ell}(\bfh(\mathbf{w}))\\ &= J_{\mathcal{O}_{L}}(\overline{\mathbf{w}}^{L})^T \,\nabla {\ell}(\bfh(\mathbf{w}))\\ &= \begin{pmatrix} A_L \, J_{\mathcal{O}_{L-1}}(\overline{\mathbf{w}}^{L-1}) & B_L \end{pmatrix}^T \,\nabla {\ell}(\bfh(\mathbf{w}))\\ &= \begin{pmatrix} J_{\mathcal{O}_{L-1}}(\overline{\mathbf{w}}^{L-1})^T A_L^T \\ B_L^T \end{pmatrix} \,\nabla {\ell}(\bfh(\mathbf{w}))\\ &= \begin{pmatrix} J_{\mathcal{O}_{L-1}}(\overline{\mathbf{w}}^{L-1})^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}. \end{align*}\]

The key is that both expressions \(A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\) and \(B_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\) are matrix-vector products. That pattern persists at the next level of recursion. Note that this supposes that we have precomputed \(\bfh(\mathbf{w})\) first.

At the next level, we expand the matrix \(J_{\mathcal{O}_{L-1}}(\overline{\mathbf{w}}^{L-1})^T\) using the fundamental recursion

\[\begin{align*} \nabla {f}(\mathbf{w}) &= \begin{pmatrix} J_{\mathcal{O}_{L-1}}(\overline{\mathbf{w}}^{L-1})^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}\\ &= \begin{pmatrix} \begin{pmatrix} A_{L-1} \, J_{\mathcal{O}_{L-2}}(\overline{\mathbf{w}}^{L-2}) & B_{L-1} \end{pmatrix}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}\\ &= \begin{pmatrix} \begin{pmatrix} J_{\mathcal{O}_{L-2}}(\overline{\mathbf{w}}^{L-2})\,A_{L-1}^T \\ B_{L-1}^T \end{pmatrix} \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}\\ &= \begin{pmatrix} J_{\mathcal{O}_{L-2}}(\overline{\mathbf{w}}^{L-2})\left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\right\} \\ B_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}. \end{align*}\]

Continuing by induction gives an alternative formula for the gradient of \(f\).

Indeed, the next level gives

\[\begin{align*} \nabla {f}(\mathbf{w}) &= \begin{pmatrix} J_{\mathcal{O}_{L-3}}(\overline{\mathbf{w}}^{L-3})\left\{A_{L-2}^T \left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\right\}\right\} \\ B_{L-2}^T \left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\right\} \\ B_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\\ B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) \end{pmatrix}. \end{align*}\]

and so on. Observe that we do not in fact need to compute the large matrices \(J_{\mathcal{O}_{i}}\) – only the sequence of vectors \(B_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\), \(B_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\), \(B_{L-2}^T \left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\right\}, \)etc.

These formulas may seem cumbersome, but they take an intuitive form. Matrix \(A_i\) is the submatrix of the Jacobian \(J_{\bfg_i}\) corresponding only to the partial derivatives with respect to \(\mathbf{z}_i\), i.e., the input from the previous layer. Matrix \(B_i\) is the submatrix of the Jacobian \(J_{\bfg_i}\) corresponding only to the partial derivatives with respect to \(\mathbf{w}_i\), i.e., the layer-specific parameters. To compute the subvector of \(\nabla f\) corresponding to the parameters \(\mathbf{w}_i\) of the \((i+1)\)-th layer, we repeatedly differentiate with respect to the inputs of the previous layer (by multiplying by the corresponding \(A_j^T\)) starting from the last one, until we reach layer \(i+1\) at which point we take partial derivatives with respect to the layer-specific parameters (by multiplying by \(B_i^T\)). The process stops there since the layers preceding it do not depend on \(\mathbf{w}_i\) and therefore its full effect on \(f\) has been accounted for.

In other words, we need to compute

\[ \mathbf{p}_{L} := A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\]

and

\[ \mathbf{q}_{L} := B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})), \]

then

\[ \mathbf{p}_{L-1} := A_{L-1}^T \mathbf{p}_{L} = A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \]

and

\[ \mathbf{q}_{L-1} := B_{L-1}^T \mathbf{p}_{L} = B_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}, \]

then

\[ \mathbf{p}_{L-2} := A_{L-2}^T \mathbf{p}_{L-1} = A_{L-2}^T \left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\}\right\} \]

and

\[ \mathbf{q}_{L-2} := B_{L-2}^T \mathbf{p}_{L-1} = B_{L-2}^T \left\{A_{L-1}^T \left\{ A_L^T \,\nabla {\ell}(\bfh(\mathbf{w}))\right\} \right\}, \]

and so on. The \(\mathbf{p}_i\)s are referred to as adjoints; they correspond to the vectors of partial derivatives of \(f\) with respect to the \(\mathbf{z}_i\)s.

There is one more detail to note. The matrices \(A_i, B_i\) depend on the output of layer \(i-1\). To compute them, we first proceed forward, that is, we let \(\mathbf{z}_0 = \mathbf{x}\) then

\[ \mathbf{z}_1 = \mathcal{O}_{0}(\overline{\mathbf{w}}^{0}) = \bfg_0(\mathbf{z}_0, \mathbf{w}_0), \]
\[ \mathbf{z}_2 = \mathcal{O}_{1}(\overline{\mathbf{w}}^{1}) = \bfg_1(\mathcal{O}_{0}(\overline{\mathbf{w}}^{0}), \mathbf{w}_1) = \bfg_1(\mathbf{z}_1, \mathbf{w}_1), \]

and so on. In that forward pass, we also compute \(A_i, B_i\) along the way.

We give the full algorithm now, which involves two passes. In the forward pass, or forward propagation step, we compute the following.

Initialization:

\[\mathbf{z}_0 := \mathbf{x}\]

Forward layer loop: For \(i = 0, 1,\ldots,L\),

\[\begin{align*} \mathbf{z}_{i+1} &:= \bfg_i(\mathbf{z}_i, \mathbf{w}_i)\\ \begin{pmatrix} A_i & B_i \end{pmatrix} &:= J_{\bfg_i}(\mathbf{z}_i, \mathbf{w}_i) \end{align*}\]

Loss:

\[\begin{align*} z_{L+2} &:= \ell(\mathbf{z}_{L+1})\\ \mathbf{p}_{L+1} &:= \nabla {\ell}(\mathbf{z}_{L+1}). \end{align*}\]

In the backward pass, or backpropagation step, we compute the following.

Backward layer loop: For \(i = L,\ldots,1, 0\),

\[\begin{align*} \mathbf{p}_{i} &:= A_i^T \mathbf{p}_{i+1}\\ \mathbf{q}_{i} &:= B_i^T \mathbf{p}_{i+1} \end{align*}\]

Output:

\[ \nabla f(\mathbf{w}) = (\mathbf{q}_0,\mathbf{q}_1,\ldots,\mathbf{q}_L). \]

Note that we do not in fact need to compute \(A_0\) and \(\mathbf{p}_0\).

EXAMPLE: (continued) We apply the algorithm to our running example. From previous calculations, for \(i = 0, 1,\ldots,L\), the Jacobians are

\[\begin{align*} J_{\bfg_i}(\mathbf{z}_i, \mathbf{w}_i) &= \begin{pmatrix} \mathbb{A}_{n_{i+1}}[\mathbf{w}_i] & \mathbb{B}_{n_{i+1}}[\mathbf{z}_i] \end{pmatrix}\\ &= \begin{pmatrix} \mathcal{W}_i & I_{n_{i+1} \times n_{i+1}} \otimes \mathbf{z}_i^T \end{pmatrix}\\ &=: \begin{pmatrix} A_i & B_i \end{pmatrix} \end{align*}\]

and

\[ J_{\ell}(\hat{\mathbf{y}}) = (\hat{\mathbf{y}} - \mathbf{y})^T. \]

Using the Properties of the Kronecker Product, we obtain

\[ \mathbf{p}_{L} := A_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) = \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \]
\[\begin{align*} \mathbf{q}_{L} &:= B_L^T \,\nabla {\ell}(\bfh(\mathbf{w})) = (I_{n_{L+1} \times n_{L+1}} \otimes \mathbf{z}_L^T)^T (\hat{\mathbf{y}} - \mathbf{y}) = (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathbf{z}_L\\ &= (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathcal{W}_{L-1} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x} \end{align*}\]
\[ \mathbf{p}_{L-1} := A_{L-1}^T \mathbf{p}_{L} = \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \]
\[\begin{align*} \mathbf{q}_{L-1} &:= B_{L-1}^T \mathbf{p}_{L} = (I_{n_{L} \times n_{L}} \otimes \mathbf{z}_{L-1}^T)^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) = \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathbf{z}_{L-1}\\ &= \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathcal{W}_{L-2} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x} \end{align*}\]
\[ \mathbf{p}_{L-2} := A_{L-2}^T \mathbf{p}_{L-1} = \mathcal{W}_{L-2}^T \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \]
\[\begin{align*} \mathbf{q}_{L-2} &:= B_{L-2}^T \mathbf{p}_{L-1} = (I_{n_{L-1} \times n_{L-1}} \otimes \mathbf{z}_{L-2}^T)^T \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) = \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathbf{z}_{L-2}\\ &= \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathcal{W}_{L-3} \cdots \mathcal{W}_{1} \mathcal{W}_{0} \mathbf{x} \end{align*}\]

and so on. Following the pattern, the last step is

\[ \mathbf{p}_1 := \mathcal{W}_{1}^T \cdots \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \]
\[ \mathbf{q}_0 := B_{0}^T \mathbf{p}_{1} = \mathcal{W}_{1}^T \cdots \mathcal{W}_{L-1}^T \mathcal{W}_L^T (\hat{\mathbf{y}} - \mathbf{y}) \otimes \mathbf{x}. \]

These calculations are consistent with the case \(L=1\) that we derived previously (check it!). \(\lhd\)

CHAT & LEARN The efficiency of backpropagation has been key to the success of deep learning. Ask your favorite AI chatbot about the history of backpropagation and its role in the development of modern deep learning. \(\ddagger\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 In the backpropagation algorithm, what does the ‘forward pass’ compute?

a) The adjoints \(p_i\) for each layer \(i\).

b) The gradients \(q_i\) for the parameters of each layer \(i\).

c) The function values \(z_i\) and the Jacobians \(A_i, B_i\) for each layer \(i\).

d) The final gradient \(\nabla f(w)\) with respect to all parameters.

2 What is the purpose of the ‘backward pass’ in the backpropagation algorithm?

a) To compute the function values \(z_i\) for each layer \(i\) from the input \(x\).

b) To compute the Jacobians \(A_i, B_i\) for each layer \(i\) using the fundamental recursion.

c) To compute the adjoints \(p_i\) and the gradients \(q_i\) for each layer \(i\) using the fundamental recursion.

d) To compute the final output \(\ell(z_{L+1})\) of the progressive function.

3 What is the computational complexity of the backpropagation algorithm in terms of the number of layers \(L\) and the matrix dimensions \(m\)?

a) \(\mathcal{O}(Lm)\)

b) \(\mathcal{O}(Lm^2)\)

c) \(\mathcal{O}(Lm^2d)\)

d) \(\mathcal{O}(Lm^3d)\)

4 In the context of progressive functions, what is the significance of the matrices \(A_i\) and \(B_i\)?

a) They represent the Jacobians of the layer functions with respect to the inputs and parameters, respectively.

b) They are the intermediate values computed during the forward pass.

c) They are the adjoints used in the backpropagation algorithm.

d) They are the matrices of parameters for each layer.

5 In the context of progressive functions, which of the following best describes the role of the vector \(w_i\)?

a) The input to the \(i\)-th layer.

b) The output of the \(i\)-th layer.

c) The parameters specific to the \(i\)-th layer.

d) The concatenation of parameters from all layers up to \(i\).

Answer for 1: c. Justification: The section presents the forward propagation step which computes “the following: Initialization: \(z_0 := x\) Forward layer loop: For \(i=0,1,\dots,L\), \(z_{i+1} := g_i(z_i, w_i)\) \((A_i,B_i) := J_{g_i}(z_i, w_i)\) Loss: \(z_{L+2} := \ell(z_{L+1})\)

Answer for 2: c. Justification: The backward pass is described as follows: “Backward layer loop: For \(i=L,\dots,1,0\), \(p_i := A_i^T p_{i+1}\) \(q_i := B_i^T p_{i+1}\) Output: \(\nabla f(w) = (q_0, q_1, \dots, q_L)\).”

Answer for 3: b. Justification: The text derives that the number of operations in the reverse mode is approximately \(2Lm^2\), stating “This is approximately \(2Lm^2\) – which can be much smaller than \(2Lm^2d\)!”

Answer for 4: a. Justification: The text defines \(A_i\) and \(B_i\) as the blocks of the Jacobian \(J_{g_i}(z_i, w_i)\) corresponding to the partial derivatives with respect to \(z_i\) and \(w_i\), respectively.

Answer for 5: c. Justification: The text explains: “In the machine learning context, each “layer” \(g_i\) has parameters (in our running example, there were the entries of \(W_i\)) and we seek to optimize with respect to those parameters.”