Building blocks of AI 2: stochastic gradient descent

\(\newcommand{\bmu}{\boldsymbol{\mu}}\) \(\newcommand{\bSigma}{\boldsymbol{\Sigma}}\) \(\newcommand{\bfbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bflambda}{\boldsymbol{\lambda}}\) \(\newcommand{\bgamma}{\boldsymbol{\gamma}}\) \(\newcommand{\bsigma}{{\boldsymbol{\sigma}}}\) \(\newcommand{\bpi}{\boldsymbol{\pi}}\) \(\newcommand{\btheta}{{\boldsymbol{\theta}}}\) \(\newcommand{\bphi}{\boldsymbol{\phi}}\) \(\newcommand{\balpha}{\boldsymbol{\alpha}}\) \(\newcommand{\blambda}{\boldsymbol{\lambda}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bp}{\mathbf{p}}\) \(\renewcommand{\bx}{\mathbf{x}}\) \(\newcommand{\bX}{\mathbf{X}}\) \(\newcommand{\by}{\mathbf{y}}\) \(\newcommand{\bY}{\mathbf{Y}}\) \(\newcommand{\bz}{\mathbf{z}}\) \(\newcommand{\bZ}{\mathbf{Z}}\) \(\newcommand{\bw}{\mathbf{w}}\) \(\newcommand{\bW}{\mathbf{W}}\) \(\newcommand{\bv}{\mathbf{v}}\) \(\newcommand{\bV}{\mathbf{V}}\) \(\newcommand{\bfg}{\mathbf{g}}\) \(\newcommand{\bfh}{\mathbf{h}}\) \(\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}\) \(\renewcommand{\S}{\mathcal{S}}\) \(\newcommand{\X}{\mathcal{X}}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\pa}{\mathrm{pa}}\) \(\newcommand{\Z}{\mathcal{Z}}\) \(\newcommand{\bh}{\mathbf{h}}\) \(\newcommand{\bb}{\mathbf{b}}\) \(\newcommand{\bc}{\mathbf{c}}\) \(\newcommand{\cE}{\mathcal{E}}\) \(\newcommand{\cP}{\mathcal{P}}\) \(\newcommand{\bbeta}{\boldsymbol{\beta}}\) \(\newcommand{\bLambda}{\boldsymbol{\Lambda}}\) \(\newcommand{\cov}{\mathrm{Cov}}\) \(\newcommand{\bfk}{\mathbf{k}}\) \(\newcommand{\idx}[1]{}\) \(\newcommand{\xdi}{}\)

8.4. Building blocks of AI 2: stochastic gradient descent#

Having shown how to compute the gradient, we can now apply gradient descent to fit the data.

To get the full gradient, we now consider \(n\) samples \((\mathbf{x}_i,y_i)\), \(i=1,\ldots,n\). At this point, we make the dependence on \((\mathbf{x}_i, y_i)\) explicit. The loss function can be taken as the average of the individual sample contributions, so the gradient is obtained by linearity

\[ \nabla \left(\frac{1}{n} \sum_{i=1}^n f_{\mathbf{x}_i,y_i}(\mathbf{w})\right) = \frac{1}{n} \sum_{i=1}^n \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w}), \]

where each term can be computed separately by the procedure above.

We can then apply gradient decent. We start from an arbitrary \(\mathbf{w}^{0}\) and update as follows

\[ \mathbf{w}^{t+1} = \mathbf{w}^{t} - \alpha_t \left(\frac{1}{n} \sum_{i=1}^n \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w}^{t})\right). \]

In a large dataset, computing the sum over all samples may be prohibitively expensive. We present a popular alternative.

8.4.1. Algorithm#

In stochastic gradient descent (SGD)\(\idx{stochastic gradient descent}\xdi\), a variant of gradient descent, we pick a sample \(I_t\) uniformly at random in \(\{1,\ldots,n\}\) and update as follows

\[ \mathbf{w}^{t+1} = \mathbf{w}^{t} - \alpha_t \nabla f_{\mathbf{x}_{I_t},y_{I_t}}(\mathbf{w}^{t}). \]

More generally, in the so-called mini-batch version of SGD, we pick instead a uniformly random sub-sample \(\mathcal{B}_t \subseteq \{1,\ldots,n\}\) of size \(b\) without replacement (i.e., all sub-samples of that size have the same probability of being picked)

\[ \mathbf{w}^{t+1} = \mathbf{w}^{t} - \alpha_t \frac{1}{b} \sum_{i\in \mathcal{B}_t} \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w}^{t}). \]

The key observation about the two stochastic updates above is that, in expectation, they perform a step of gradient descent. That turns out to be enough and it has computational advantages.

LEMMA Fix a batch size \(1 \leq b \leq n\) and and an arbitrary vector of parameters \(\mathbf{w}\). Let \(\mathcal{B} \subseteq \{1,\ldots,n\}\) be a uniformly random sub-sample of size \(b\). Then

\[ \mathbb{E}\left[\frac{1}{b} \sum_{i\in \mathcal{B}} \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w})\right] = \frac{1}{n} \sum_{i=1}^n \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w}). \]

\(\flat\)

Proof: Because \(\mathcal{B}\) is picked uniformly at random (without replacement), for any sub-sample \(B \subseteq \{1,\ldots,n\}\) of size \(b\) without repeats

\[ \mathbb{P}[\mathcal{B} = B] = \frac{1}{\binom{n}{b}}. \]

So, so summing over all such sub-samples, we obtain

\[\begin{align*} \mathbb{E}\left[\frac{1}{b} \sum_{i\in \mathcal{B}} \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w})\right] &= \frac{1}{b} \sum_{B \subseteq \{1,\ldots,n\}} \mathbb{P}[\mathcal{B} = B] \sum_{i\in \mathcal{B}} \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w})\\ &= \frac{1}{b} \sum_{B \subseteq \{1,\ldots,n\}} \frac{1}{\binom{n}{b}} \sum_{i=1}^n \mathbf{1}\{i \in B\} \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w})\\ &= \sum_{i=1}^n \nabla f_{\mathbf{x}_i,y_i}(\mathbf{w}) \frac{1}{b \binom{n}{b}} \sum_{B \subseteq \{1,\ldots,n\}} \mathbf{1}\{i \in B\}. \end{align*}\]

Computing the internal sum requires a combinatorial argument. Indeed, \(\sum_{B \subseteq \{1,\ldots,n\}} \mathbf{1}\{i \in B\}\) counts the number of ways that \(i\) can be picked in a sub-sample of size \(b\) without repeats. That is \(\binom{n-1}{b-1}\), which is the number of ways of picking the remaining \(b-1\) elements of \(B\) from the other \(n-1\) possible elements. By definition of the binomial coefficient and the properties of factorials,

\[ \frac{\binom{n-1}{b-1}}{b \binom{n}{b}} = \frac{\frac{(n-1)!}{(b-1)! (n-b)!}}{b \frac{n!}{b! (n-b)!}} = \frac{(n-1)!}{n!} \frac{b!}{b (b-1)!} = \frac{1}{n}. \]

Plugging back gives the claim. \(\square\)

As a first illustration, we return to logistic regression\(\idx{logistic regression}\xdi\). Recall that the input data is of the form \(\{(\boldsymbol{\alpha}_i, b_i) : i=1,\ldots, n\}\) where \(\boldsymbol{\alpha}_i = (\alpha_{i,1}, \ldots, \alpha_{i,d}) \in \mathbb{R}^d\) are the features and \(b_i \in \{0,1\}\) is the label. As before we use a matrix representation: \(A \in \mathbb{R}^{n \times d}\) has rows \(\boldsymbol{\alpha}_i^T\), \(i = 1,\ldots, n\) and \(\mathbf{b} = (b_1, \ldots, b_n) \in \{0,1\}^n\). We want to solve the minimization problem

\[ \min_{\mathbf{x} \in \mathbb{R}^d} \ell(\mathbf{x}; A, \mathbf{b}) \]

where the loss is

\[\begin{align*} \ell(\mathbf{x}; A, \mathbf{b}) &= \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha_i}^T \mathbf{x})) - (1-b_i) \log(1- \sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))\right\}\\ &= \mathrm{mean}\left(-\mathbf{b} \odot \mathbf{log}(\bsigma(A \mathbf{x})) - (\mathbf{1} - \mathbf{b}) \odot \mathbf{log}(\mathbf{1} - \bsigma(A \mathbf{x}))\right). \end{align*}\]

The gradient was previously computed as

\[\begin{align*} \nabla\ell(\mathbf{x}; A, \mathbf{b}) &= - \frac{1}{n} \sum_{i=1}^n ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) ) \,\boldsymbol{\alpha}_i\\ &= -\frac{1}{n} A^T [\mathbf{b} - \bsigma(A \mathbf{x})]. \end{align*}\]

For the mini-batch version of SGD, we pick a random sub-sample \(\mathcal{B}_t \subseteq \{1,\ldots,n\}\) of size \(B\) and take the step

\[ \mathbf{x}^{t+1} = \mathbf{x}^{t} +\beta \frac{1}{B} \sum_{i\in \mathcal{B}_t} ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) ) \,\boldsymbol{\alpha}_i. \]

We modify our previous code for logistic regression. The only change is to pick a random mini-batch which can be fed to the descent update sub-routine as dataset.

def sigmoid(z): 
    return 1/(1+np.exp(-z))

def pred_fn(x, A): 
    return sigmoid(A @ x)

def loss_fn(x, A, b): 
    return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))

def grad_fn(x, A, b):
    return -A.T @ (b - pred_fn(x, A))/len(b)

def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
    gradient = grad_fn(curr_x, A, b)
    return curr_x - beta*gradient

def sgd_for_logreg(rng, loss_fn, grad_fn, A, b, 
                   init_x, beta=1e-3, niters=int(1e5), batch=40):
    
    curr_x = init_x
    nsamples = len(b)
    for _ in range(niters):
        I = rng.integers(nsamples, size=batch)
        curr_x = desc_update_for_logreg(
            grad_fn, A[I,:], b[I], curr_x, beta)
    
    return curr_x

NUMERICAL CORNER: We analyze a dataset from [ESL], which can be downloaded here. Quoting [ESL, Section 4.4.2]

The data […] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).

We load the data, which we slightly reformatted and look at a summary.

data = pd.read_csv('SAHeart.csv')
data.head()
sbp tobacco ldl adiposity typea obesity alcohol age chd
0 160.0 12.00 5.73 23.11 49.0 25.30 97.20 52.0 1.0
1 144.0 0.01 4.41 28.61 55.0 28.87 2.06 63.0 1.0
2 118.0 0.08 3.48 32.28 52.0 29.14 3.81 46.0 0.0
3 170.0 7.50 6.41 38.03 51.0 31.99 24.26 58.0 1.0
4 134.0 13.60 3.50 27.78 60.0 25.99 57.34 49.0 1.0

Our goal to predict chd, which stands for coronary heart disease, based on the other variables (which are briefly described here). We use logistic regression again.

We first construct the data matrices. We only use three of the predictors.

feature = data[['tobacco', 'ldl', 'age']].to_numpy()
print(feature)
[[1.200e+01 5.730e+00 5.200e+01]
 [1.000e-02 4.410e+00 6.300e+01]
 [8.000e-02 3.480e+00 4.600e+01]
 ...
 [3.000e+00 1.590e+00 5.500e+01]
 [5.400e+00 1.161e+01 4.000e+01]
 [0.000e+00 4.820e+00 4.600e+01]]
label = data['chd'].to_numpy()
A = np.concatenate((np.ones((len(label),1)),feature),axis=1)
b = label

We try mini-batch SGD.

seed = 535
rng = np.random.default_rng(seed)
init_x = np.zeros(A.shape[1])
best_x = sgd_for_logreg(rng, loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e6))
print(best_x)
[-4.06558071  0.07990955  0.18813635  0.04693118]

The outcome is harder to vizualize. To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label \(1\) whenever \(\sigma(\boldsymbol{\alpha}^T \mathbf{x}) > 1/2\). We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)

def logis_acc(x, A, b):
    return np.sum((pred_fn(x, A) > 0.5) == b)/len(b)
logis_acc(best_x, A, b)
0.7207792207792207

\(\unlhd\)

8.4.2. Example: multinomial logistic regression#

We give a concrete example of progressive functions and of the application of backpropagation and SGD.

Recall that a classifier \(h\) takes an input in \(\mathbb{R}^d\) and predicts one of \(K\) possible labels. It will be convenient for reasons that will become clear below to use one-hot encoding\(\idx{one-hot encoding}\xdi\) of the labels. That is, we encode label \(i\) as the \(K\)-dimensional vector \(\mathbf{e}_i\). Here, as usual, \(\mathbf{e}_i\) the canonical basis of \(\mathbb{R}^K\), i.e., the vector with a \(1\) in entry \(i\) and a \(0\) elsewhere. Furthermore, we allow the output of the classifier to be a probability distribution over the labels \(\{1,\ldots,K\}\), that is, a vector in

\[ \Delta_K = \left\{ (p_1,\ldots,p_K) \in [0,1]^K \,:\, \sum_{k=1}^K p_k = 1 \right\}. \]

Background on multinomial logistic regression We turn to multinomial logistic regression\(\idx{multinomial logistic regression}\xdi\) to learn a classifier over \(K\) labels. Recall that we encode label \(i\) as the \(K\)-dimensional vector \(\mathbf{e}_i\). We allow the output of the classifier to be a probability distribution over the labels \(\{1,\ldots,K\}\). Observe that \(\mathbf{e}_i\) can itself be thought of as a probability distribution, one that assigns probability one to \(i\).

In multinomial logistic regression, we once again use an affine function of the input data.

This time, we have \(K\) functions that output a score associated to each label. We then transform these scores into a probability distribution over the \(K\) labels. There are many ways of doing this. A standard approach is the softmax function\(\idx{softmax}\xdi\) \(\bgamma = (\gamma_1,\ldots,\gamma_K)\): for \(\mathbf{z} \in \mathbb{R}^K\)

\[ \gamma_i(\mathbf{z}) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}, \quad i=1,\ldots,K. \]

To explain the name, observe that the larger inputs are mapped to larger probabilities.

In fact, since a probability distribution must sum to \(1\), it is determined by the probabilities assigned to the first \(K-1\) labels. In other words, we could drop the score associated to the last label. But the keep the notation simple, we will not do this here.

For each \(k\), we have a regression function

\[ \sum_{j=1}^d w^{(k)}_{1,j} x_{j} = \mathbf{x}_1^T \mathbf{w}^{(k)}, \quad k=1,\ldots,K \]

where \(\mathbf{w} = (\mathbf{w}^{(1)},\ldots,\mathbf{w}^{(K)})\) are the parameters with \(\mathbf{w}^{(k)} \in \mathbb{R}^{d}\) and \(\mathbf{x} \in \mathbb{R}^d\) is the input. A constant term can be included by adding an additional entry \(1\) to \(\mathbf{x}\). As we did in the linear regression case, we assume that this pre-processing has been performed previously. To simplify the notation, we let \(\mathcal{W} \in \mathbb{R}^{K \times d}\) as the matrix with rows \((\mathbf{w}^{(1)})^T,\ldots,(\mathbf{w}^{(K)})^T\).

The output of the classifier is

\[\begin{align*} \bfh(\mathbf{w}) &= \bgamma\left(\mathcal{W} \mathbf{x}\right), \end{align*}\]

for \(i=1,\ldots,K\), where \(\bgamma\) is the softmax function. Note that the latter has no associated parameter.

It remains to define a loss function. To quantify the fit, it is natural to use a notion of distance between probability measures, here between the output \(\mathbf{h}_{\mathbf{x}}(\mathbf{w}) \in \Delta_K\) and the correct label \(\mathbf{y} \in \{\mathbf{e}_1,\ldots,\mathbf{e}_{K}\} \subseteq \Delta_K\). There are many such measures. In multinomial logistic regression, we use the Kullback-Leibler divergence, which we have encountered in the context of maximum likelihood estimation. Recall that, for two probability distributions \(\mathbf{p}, \mathbf{q} \in \Delta_K\), it is defined as

\[ \mathrm{KL}(\mathbf{p} \| \mathbf{q}) = \sum_{i=1}^K p_i \log \frac{p_i}{q_i} \]

where it will suffice to restrict ourselves to the case \(\mathbf{q} > \mathbf{0}\) and where we use the convention \(0 \log 0 = 0\) (so that terms with \(p_i = 0\) contribute \(0\) to the sum). Notice that \(\mathbf{p} = \mathbf{q}\) implies \(\mathrm{KL}(\mathbf{p} \| \mathbf{q}) = 0\). We proved previously that \(\mathrm{KL}(\mathbf{p} \| \mathbf{q}) \geq 0\), a result known as Gibbs’ inequality.

Going back to the loss function, we use the identity \(\log\frac{\alpha}{\beta} = \log \alpha - \log \beta\) to re-write

\[\begin{align*} \mathrm{KL}(\mathbf{y} \| \bfh(\mathbf{w})) &= \sum_{i=1}^K y_i \log \frac{y_i}{h_{i}(\mathbf{w})}\\ &= \sum_{i=1}^K y_i \log y_i - \sum_{i=1}^K y_i \log h_{i}(\mathbf{w}), \end{align*}\]

where \(\bfh = (h_{1},\ldots,h_{K})\). Notice that the first term on right-hand side does not depend on \(\mathbf{w}\). Hence we can ignore it when optimizing \(\mathrm{KL}(\mathbf{y} \| \bfh(\mathbf{w}))\). The remaining term is

\[ H(\mathbf{y}, \bfh(\mathbf{w})) = - \sum_{i=1}^K y_i \log h_{i}(\mathbf{w}). \]

We use it to define our loss function. That is, we set

\[ \ell(\hat{\mathbf{y}}) = H(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{i=1}^K y_i \log \hat{y}_{i}. \]

Finally,

\[\begin{align*} f(\mathbf{w}) &= \ell(\bfh(\mathbf{w}))\\ &= H(\mathbf{y}, \bfh(\mathbf{w}))\\ &= H\left(\mathbf{y}, \bgamma\left(\mathcal{W} \mathbf{x}\right)\right)\\ &= - \sum_{i=1}^K y_i \log\gamma_i\left(\mathcal{W} \mathbf{x}\right). \end{align*}\]

Computing the gradient We apply the forward and backpropagation steps from the previous section. We then use the resulting recursions to derive an analytical formula for the gradient.

The forward pass starts with the initialization \(\mathbf{z}_0 := \mathbf{x}\). The forward layer loop has two steps. Set \(\mathbf{w}_0 = (\mathbf{w}_0^{(1)},\ldots,\mathbf{w}_0^{(K)})\) equal to \(\mathbf{w} = (\mathbf{w}^{(1)},\ldots,\mathbf{w}^{(K)})\). First we compute

\[\begin{align*} \mathbf{z}_{1} &:= \bfg_0(\mathbf{z}_0,\mathbf{w}_0) = \mathcal{W}_0 \mathbf{z}_0\\ J_{\bfg_0}(\mathbf{z}_0,\mathbf{w}_0) &:=\begin{pmatrix} A_0 & B_0 \end{pmatrix} \end{align*}\]

where we defined \(\mathcal{W}_0 \in \mathbb{R}^{K \times d}\) as the matrix with rows \((\mathbf{w}_0^{(1)})^T,\ldots,(\mathbf{w}_0^{(K-1)})^T\). We have previously computed the Jacobian:

\[\begin{split} A_0 = \mathbb{A}_{K}[\mathbf{w}_0] = \mathcal{W}_0 = \begin{pmatrix} (\mathbf{w}^{(1)}_0)^T\\ \vdots\\ (\mathbf{w}^{(K)}_0)^T \end{pmatrix} \end{split}\]

and

\[ B_0 = \mathbb{B}_{K}[\mathbf{z}_0] = I_{K\times K} \otimes \mathbf{z}_0^T = \begin{pmatrix} \mathbf{e}_1 \mathbf{z}_0^T & \cdots & \mathbf{e}_{K}\mathbf{z}_0^T \end{pmatrix}. \]

In the second step of the forward layer loop, we compute

\[\begin{align*} \hat{\mathbf{y}} := \mathbf{z}_2 &:= \bfg_1(\mathbf{z}_1) = \bgamma(\mathbf{z}_1)\\ A_1 &:= J_{\bfg_1}(\mathbf{z}_1) = J_{\bgamma}(\mathbf{z}_1). \end{align*}\]

So we need to compute the Jacobian of \(\bgamma\). We divide this computation into two cases. When \(1 \leq i = j \leq K\),

\[\begin{align*} (A_1)_{ii} &= \frac{\partial}{\partial z_{1,i}} \left[ \gamma_i(\mathbf{z}_1) \right]\\ &= \frac{\partial}{\partial z_{1,i}} \left[ \frac{e^{z_{1,i}}}{\sum_{k=1}^{K} e^{z_{1,k}}} \right]\\ &= \frac{e^{z_{1,i}}\left(\sum_{k=1}^{K} e^{z_{1,k}}\right) - e^{z_{1,i}}\left(e^{z_{1,i}}\right)} {\left(\sum_{k=1}^{K} e^{z_{1,k}}\right)^2}\\ &= \gamma_i(\mathbf{z}_1) - \gamma_i(\mathbf{z}_1)^2, \end{align*}\]

by the quotient rule.

When \(1 \leq i, j \leq K\) with \(i \neq j\),

\[\begin{align*} (A_1)_{ij} &= \frac{\partial}{\partial z_{1,j}} \left[ \gamma_i(\mathbf{z}_1) \right]\\ &= \frac{\partial}{\partial z_{1,j}} \left[ \frac{e^{z_{1,i}}}{\sum_{k=1}^{K} e^{z_{1,k}}} \right]\\ &= \frac{- e^{z_{1,i}}\left(e^{z_{1,j}}\right)} {\left(\sum_{k=1}^{K} e^{z_{1,k}}\right)^2}\\ &= - \gamma_i(\mathbf{z}_1)\gamma_j(\mathbf{z}_1). \end{align*}\]

In matrix form,

\[ A_1 = \mathrm{diag}(\bgamma(\mathbf{z}_1)) - \bgamma(\mathbf{z}_1) \, \bgamma(\mathbf{z}_1)^T. \]

The Jacobian of the loss function is

\[ J_{\ell}(\hat{\mathbf{y}}) = \nabla \left[ - \sum_{i=1}^K y_i \log \hat{y}_{i} \right]^T = -\left(\frac{y_1}{\hat{y}_{1}}, \ldots, \frac{y_K}{\hat{y}_{K}}\right)^T = - (\mathbf{y}\oslash\hat{\mathbf{y}})^T, \]

where recall that \(\oslash\) is the Hadamard division (i.e., element-wise division).

We summarize the whole procedure next.

Initialization:

\[\mathbf{z}_0 := \mathbf{x}\]

Forward layer loop:

\[\begin{align*} \mathbf{z}_{1} &:= \bfg_0(\mathbf{z}_0, \mathbf{w}_0) = \mathcal{W}_0 \mathbf{z}_0\\ \begin{pmatrix} A_0 & B_0 \end{pmatrix} &:= J_{\bfg_0}(\mathbf{z}_0,\mathbf{w}_0) = \begin{pmatrix} \mathbb{A}_{K}[\mathbf{w}_0] & \mathbb{B}_{K}[\mathbf{z}_0] \end{pmatrix} \end{align*}\]
\[\begin{align*} \hat{\mathbf{y}} := \mathbf{z}_2 &:= \bfg_1(\mathbf{z}_1) = \bgamma(\mathbf{z}_1)\\ A_1 &:= J_{\bfg_1}(\mathbf{z}_1) = \mathrm{diag}(\bgamma(\mathbf{z}_1)) - \bgamma(\mathbf{z}_1) \, \bgamma(\mathbf{z}_1)^T \end{align*}\]

Loss:

\[\begin{align*} z_3 &:= \ell(\mathbf{z}_2) = - \sum_{i=1}^K y_i \log z_{2,i}\\ \mathbf{p}_2 &:= \nabla {\ell_{\mathbf{y}}}(\mathbf{z}_2) = -\left(\frac{y_1}{z_{2,1}}, \ldots, \frac{y_K}{z_{2,K}}\right) = - \mathbf{y} \oslash \mathbf{z}_2. \end{align*}\]

Backward layer loop:

\[\begin{align*} \mathbf{p}_{1} &:= A_1^T \mathbf{p}_{2} \end{align*}\]
\[\begin{align*} \mathbf{q}_{0} &:= B_0^T \mathbf{p}_{1} \end{align*}\]

Output:

\[ \nabla f(\mathbf{w}) = \mathbf{q}_0, \]

where recall that \(\mathbf{w} := \mathbf{w}_0\).

Explicit formulas can be derived from the previous recursion.

We first compute \(\mathbf{p}_1\). We use the Properties of the Hadamard Product. We get for \(1 \leq j \leq K\)

\[\begin{align*} \mathbf{p}_1 &= A_1^T \mathbf{p}_{2}\\ &= [\mathrm{diag}(\bgamma(\mathbf{z}_1)) - \bgamma(\mathbf{z}_1) \, \bgamma(\mathbf{z}_1)^T]^T [- \mathbf{y} \oslash \bgamma(\mathbf{z}_1)]\\ &= - \mathrm{diag}(\bgamma(\mathbf{z}_1)) \, (\mathbf{y} \oslash \bgamma(\mathbf{z}_1)) + \bgamma(\mathbf{z}_1) \, \bgamma(\mathbf{z}_1)^T \, (\mathbf{y} \oslash \bgamma(\mathbf{z}_1))\\ &= - \mathbf{y} + \bgamma(\mathbf{z}_1) \, \mathbf{1}^T\mathbf{y}\\ &= \bgamma(\mathbf{z}_1) - \mathbf{y}, \end{align*}\]

where we used that \(\sum_{k=1}^{K} y_k = 1\).

It remains to compute \(\mathbf{q}_0\). We have by parts (e) and (f) of the Properties of the Kronecker Product

\[\begin{align*} \mathbf{q}_{0} = B_0^T \mathbf{p}_{1} &= (I_{K\times K} \otimes \mathbf{z}_0^T)^T (\bgamma(\mathbf{z}_1) - \mathbf{y})\\ &= ( I_{K\times K} \otimes \mathbf{z}_0)\, (\bgamma(\mathbf{z}_1) - \mathbf{y})\\ &= (\bgamma(\mathbf{z}_1) - \mathbf{y}) \otimes \mathbf{z}_0. \end{align*}\]

Finally, replacing \(\mathbf{z}_0 = \mathbf{x}\) and \(\mathbf{z}_1 = \mathcal{W} \mathbf{x}\), the gradient is

\[ \nabla f(\mathbf{w}) = \mathbf{q}_0 = (\bgamma\left(\mathcal{W} \mathbf{x}\right) - \mathbf{y}) \otimes \mathbf{x}. \]

It can be shown that the objective function \(f(\mathbf{w})\) is convex in \(\mathbf{w}\).

NUMERICAL CORNER: We will use the Fashion MNIST dataset. This example is inspired by these tutorials. We first check for the availability of GPUs and load the data.

device = torch.device("cuda" if torch.cuda.is_available() 
                      else ("mps" if torch.backends.mps.is_available() 
                            else "cpu"))
print("Using device:", device)
Using device: mps
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim

seed = 42
torch.manual_seed(seed)

if device.type == 'cuda': # device-specific seeding and settings
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # for multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
elif device.type == 'mps':
    torch.mps.manual_seed(seed)  # MPS-specific seeding

g = torch.Generator()
g.manual_seed(seed)

train_dataset = datasets.FashionMNIST(root='./data', train=True, 
                               download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False, 
                              download=True, transform=transforms.ToTensor())

BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, generator=g)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

We used torch.utils.data.DataLoader, which provides utilities to load the data in batches for training. We took mini-batches of size BATCH_SIZE = 32 and we apply a random permutation of the samples on every pass over the training data (with the option shuffle=True). The function torch.manual_seed() is used to set the global seed for PyTorch operations (e.g., weight initialization). The shuffling in DataLoader uses its own separate random number generator, which we initialize with torch.Generator() and manual_seed(). (You can tell from the fact that seed=42 that Claude explained that one to me…)

CHAT & LEARN Ask your favorite AI chatbot to explain the lines:

if device.type == 'cuda': # device-specific seeding and settings
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # for multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
elif device.type == 'mps':
    torch.mps.manual_seed(seed)  # MPS-specific seeding

\(\ddagger\)

We implement multinomial logistic regression to learn a classifier for the MNIST data. In PyTorch, composition of functions can be achieved with torch.nn.Sequential. Our model is:

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 10)
).to(device)

The torch.nn.Flatten layer turns each input image into a vector of size \(784\) (where \(784 = 28^2\) is the number of pixels in each image). After the flattening, we have an affine map from \(\mathbb{R}^{784}\) to \(\mathbb{R}^{10}\). Note that there is no need to pre-process the inputs by adding \(1\)s. A constant term (or “bias variable”) is automatically added by PyTorch (unless one chooses the option bias=False). The final output is \(10\)-dimensional.

Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many optimizers available. (See this post for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. A quick tutorial is here. The loss function is the cross-entropy, as implemented by torch.nn.CrossEntropyLoss, which first takes the softmax and expects the labels to be class names rather than their one-hot encoding.

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

We implement special functions for training.

def train(dataloader, model, loss_fn, optimizer, device):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)    
        pred = model(X)
        loss = loss_fn(pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

def training_loop(train_loader, model, loss_fn, optimizer, device, epochs=3):
    for epoch in range(epochs):
        train(train_loader, model, loss_fn, optimizer, device)
        print(f"Epoch {epoch+1}/{epochs}")

An epoch is one training iteration where all samples are iterated once (in a randomly shuffled order). In the interest of time, we train for 10 epochs only. But it does better if you train it longer (try it!). On each pass, we compute the output of the current model, use backward() to obtain the gradient, and then perform a descent update with step(). We also have to reset the gradients first (otherwise they add up by default).

training_loop(train_loader, model, loss_fn, optimizer, device, epochs=10)

Because of the issue of overfitting, we use the test images to assess the performance of the final classifier.

def test(dataloader, model, loss_fn, device):
    size = len(dataloader.dataset)
    correct = 0    
    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            correct += (pred.argmax(dim=1) == y).type(torch.float).sum().item()

    print(f"Test error: {(100*(correct / size)):>0.1f}% accuracy")
test(test_loader, model, loss_fn, device)
Test error: 78.7% accuracy

To make a prediction, we take a torch.nn.functional.softmax of the output of our model. Recall that it is implicitly included in torch.nn.CrossEntropyLoss, but is not actually part of model. (Note that the softmax itself has no parameter.)

As an illustration, we do this for each test image. We use torch.cat to concatenate a sequence of tensors into a single tensor.

import torch.nn.functional as F

def predict_softmax(dataloader, model, device):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    predictions = []
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            probabilities = F.softmax(pred, dim=1)
            predictions.append(probabilities.cpu())
            
    return torch.cat(predictions, dim=0)

predictions = predict_softmax(test_loader, model, device).numpy()

The result for the first test image is shown below. To make a prediction, we choose the label with the highest probability.

print(predictions[0])
[4.4307165e-04 3.8354204e-04 2.0886613e-03 8.8066678e-04 3.6079765e-03
 1.7791630e-01 1.4651606e-03 2.2466542e-01 4.8245404e-02 5.4030383e-01]
predictions[0].argmax(0)
9

The truth is:

images, labels = next(iter(test_loader))
images = images.squeeze().numpy()
labels = labels.numpy()

print(f"{labels[0]}: '{mmids.FashionMNIST_get_class_name(labels[0])}'")
9: 'Ankle boot'

Above, next(iter(test_loader)) loads the first batch of test images. (See here for background on iterators in Python.)

\(\unlhd\)

Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)

1 In stochastic gradient descent (SGD), how is the gradient estimated at each iteration?

a) By computing the gradient over the entire dataset.

b) By using the gradient from the previous iteration.

c) By randomly selecting a subset of sample and computing their gradient.

d) By averaging the gradients of all samples in the dataset.

2 What is the key advantage of using mini-batch SGD over standard SGD?

a) It guarantees faster convergence to the optimal solution.

b) It reduces the variance of the gradient estimate at each iteration.

c) It eliminates the need for computing gradients altogether.

d) It increases the computational cost per iteration.

3 Which of the following statements is true about the expected update step in stochastic gradient descent?

a) It is always equal to the full gradient descent update.

b) It is always in the opposite direction of the full gradient descent update.

c) It is, on average, equivalent to the full gradient descent update.

d) It has no relationship to the full gradient descent update.

4 In multinomial logistic regression, what is the role of the softmax function (\(\gamma\))?

a) To compute the gradient of the loss function.

b) To normalize the input features.

c) To transform scores into a probability distribution over labels.

d) To update the model parameters during gradient descent.

5 What is the Kullback-Leibler (KL) divergence used for in multinomial logistic regression?

a) To measure the distance between the predicted probabilities and the true labels.

b) To normalize the input features.

c) To update the model parameters during gradient descent.

d) To compute the gradient of the loss function.

Answer for 1: c. Justification: The text states that in SGD, “we pick a sample uniformly at random in \(\{1, ..., n\}\) and update as follows \(w^{t+1} = w^t - \alpha_t \nabla f_{x_{I_t}, y_{I_t}}(w^t).\)

Answer for 2: b. Justification: The text implies that mini-batch SGD reduces the variance of the gradient estimate compared to standard SGD, which only uses a single sample.

Answer for 3: c. Justification: The text proves a lemma stating that “in expectation, they [stochastic updates] perform a step of gradient descent.”

Answer for 4: c. Justification: The text defines the softmax function and states that it is used to “transform these scores into a probability distribution over the labels.”

Answer for 5: a. Justification: The text introduces the KL divergence as a “notion of distance between probability measures” and uses it to define the loss function in multinomial logistic regression.