Image credit: Made with Midjourney
8. Neural networks, backpropagation and stochastic gradient descent#
We end the book with an introduction to the basic mathematical building blocks of modern AI. We first derive a generalization of the Chain Rule and give a brief overview of automatic differentiation. We then describe backpropagation in the context progressive functions, implement stochastic gradient descent (SGD), and apply these methods to deep neural networks (specifically, multilayer perceptrons). Here is a more detailed overview of the main sections of the chapter.
“Background: Jacobian, chain rule, and a brief introduction to automatic differentiation” This section introduces the Jacobian matrix, which generalizes the concept of the derivative to vector-valued functions of several variables, as well as the generalized Chain Rule for composing differentiable functions in this setting. It also covers some useful matrix algebra, specifically the Hadamard and Kronecker products. Finally, the section gives a brief introduction to automatic differentiation, a powerful technique for efficiently computing derivatives that is central to modern machine learning, and illustrates its use with the PyTorch library.
“Building blocks of AI 1: backpropagation” This section develops the mathematical foundations for automatic differentiation in the context of multi-layer progressive functions, which are sequences of compositions with layer-specific parameters. It explains how to systematically apply the Chain Rule to compute gradients of these functions. The section contrasts two methods for computing gradients: the forward mode and the reverse mode (also known as backpropagation). While the forward mode computes the function and its gradient simultaneously in a recursive manner, the reverse mode is often more efficient, especially for functions with many parameters but a small number of outputs. The reverse mode achieves this efficiency by in essence computing matrix-vector products instead of matrix-matrix products, making it particularly well-suited for machine learning applications.
“Building blocks of AI 2: stochastic gradient descent” This section discusses stochastic gradient descent (SGD), a popular optimization algorithm used to train machine learning models, particularly in scenarios with large datasets. It is a variation of gradient descent where, instead of computing the gradient over the entire dataset, it estimates the gradient using a randomly selected subset of data points (either a single sample or a mini-batch). The key idea behind SGD is that, while each update may not be perfectly aligned with the true gradient, the expected direction of the update is still in the direction of the steepest descent, leading to convergence over time. This approach offers computational advantages, especially when dealing with massive datasets, as it avoids the expensive calculation of the full gradient at each iteration. The section provides detailed examples of applying SGD together with backpropagation. Additionally, it covers the use of PyTorch, demonstrating how to handle datasets, construct models, and perform optimization tasks using mini-batch SGD.
“Building blocks of AI 3: neural networks” This section introduces neural networks, specifically focusing on the multilayer perceptron (MLP) architecture. It explains how each layer of an MLP consists of an affine map followed by a nonlinear activation function. In the setting of a classification task, the output layer uses the softmax function to produce a probability distribution over the possible classes; the loss function used to train the MLP is the cross-entropy loss. The section then walks through a detailed example of computing the gradient of the loss function with respect to the weights in a small MLP, using the Chain Rule and properties of Kronecker products. It generalizes this gradient computation to MLPs with an arbitrary number of layers. Finally, the section demonstrates how to implement the training of a neural network in PyTorch, using the Fashion-MNIST dataset.