Motivating example: deciphering handwriting

6.1. Motivating example: deciphering handwriting#

Figure: Helpful map of ML by scitkit-learn (Source)

ml-cheat-sheet

\(\bowtie\)

We now turn to classification.

Quoting Wikipedia:

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the “spam” or “non-spam” class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.

We will illustrate this problem on the MNIST dataset. Quoting Wikipedia again:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by “re-mixing” the samples from NIST’s original datasets. The creators felt that since NIST’s training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST’s training dataset, while the other half of the training set and the other half of the test set were taken from NIST’s testing dataset.

Figure: MNIST sample images (Source)

MNIST sample images

\(\bowtie\)

We first load the data and convert it to an appropriate matrix representation. The data can be accessed with torchvision.datasets.MNIST.

import torch
from torchvision import datasets, transforms
# Download and load the MNIST dataset
mnist = datasets.MNIST(root='./data', 
                       train=True, 
                       download=True, 
                       transform=transforms.ToTensor())

# Convert the dataset to a PyTorch DataLoader
train_loader = torch.utils.data.DataLoader(mnist, 
                                           batch_size=len(mnist), 
                                           shuffle=False)

# Extract images and labels from the DataLoader
imgs, labels = next(iter(train_loader))
imgs = imgs.squeeze().numpy()
labels = labels.numpy()

The squeeze() above removes the color dimension in the image, which is grayscale. The numpy() converts the PyTorch tensors into Numpy arrays. See torch.utils.data.DataLoader for details on the data loading. We will say more about PyTorch later in this chapter.

For example, the first image and its label are:

plt.figure()
plt.imshow(imgs[0])
plt.show()
../../_images/58da5a24aa7483611cad39d37320b148e76785814c76b440a92fe139df4869b1.png
labels[0]
5

For now, we look at a subset of the samples: the 0’s and 1’s.

# Filter out images with labels 0 and 1
mask = (labels == 0) | (labels == 1)
imgs01 = imgs[mask]
labels01 = labels[mask]

In this new dataset, the first sample is:

plt.figure()
plt.imshow(imgs01[0])
plt.show()
../../_images/c0151be6a5ec05f4ea16066ed61bc06113ab5ccce94fc6aa6e5481112e44622c.png
labels01[0]
0

Next, we transform the images into vectors. For this we use the reshape() function, which changes the dimensions of an array without changing its data. Here the first dimension, which runs across the samples, remains of the same length len(imgs01). The -1 is understood to be whatever is needed to “fit” the remaining dimensions, here \(28 \times 28 = 784\). In other words, we are effectively “flattening” each \(28 \times 28\) image into a single vector of length \(784\).

imgs01.shape
(12665, 28, 28)
X = imgs01.reshape(len(imgs01), -1)
X.shape
(12665, 784)
y = labels01

The input data is now of the form \(\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}\) where \(\mathbf{x}_i \in \mathbb{R}^d\) are the features and \(y_i \in \{0,1\}\) is the label. Above we use the matrix representation \(X \in \mathbb{R}^{d \times n}\) with columns \(\mathbf{x}_i\), \(i = 1,\ldots, n\) and \(\mathbf{y} = (y_1, \ldots, y_n)^T \in \{0,1\}^n\).

Our goal:

learn a classifier from the examples \(\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}\), that is, a function \(\hat{f} : \mathbb{R}^d \to \mathbb{R}\) such that \(\hat{f}(\mathbf{x}_i) \approx y_i\).

We may want to enforce that the output is in \(\{0,1\}\) as well. This problem is referred to as binary classification.

A natural approach to this type of supervised learning problem is to define two objects:

  1. Family of classifiers: A class \(\widehat{\mathcal{F}}\) of classifiers from which to pick \(\hat{f}\).

  2. Loss function: A loss function \(\ell(\hat{f}, (\mathbf{x},y))\) which quantifies how good of a fit \(\hat{f}(\mathbf{x})\) is to \(y\).

Our goal is then to solve

\[ \min_{\hat{f} \in \widehat{\mathcal{F}}} \frac{1}{n} \sum_{i=1}^n \ell(\hat{f}, (\mathbf{x}_i, y_i)), \]

that is, we seek to find a classifier among \(\widehat{\mathcal{F}}\) that minimizes the average loss over the examples.

For instance, in logistic regression, we consider linear classifiers of the form

\[ \hat{f}(\mathbf{x}) = \sigma(\mathbf{x}^T \boldsymbol{\theta}) \qquad \text{with} \qquad \sigma(t) = \frac{1}{1 + e^{-t}} \]

where \(\boldsymbol{\theta} \in \mathbb{R}^d\) is a parameter vector. And we use the cross-entropy loss

\[ \ell(\hat{f}, (\mathbf{x}, y)) = - y \log(\sigma(\mathbf{x}^T \boldsymbol{\theta})) - (1-y) \log(1- \sigma(\mathbf{x}^T \boldsymbol{\theta})). \]

In parametric form, the problem boils down to

\[ \min_{\boldsymbol{\theta} \in \mathbb{R}^d} - \frac{1}{n} \sum_{i=1}^n y_i \log(\sigma(\mathbf{x}_i^T \boldsymbol{\theta})) - \frac{1}{n} \sum_{i=1}^n (1-y_i) \log(1- \sigma(\mathbf{x}_i^T \boldsymbol{\theta})). \]

To obtain a prediction in \(\{0,1\}\) here, we could cutoff \(\hat{f}(\mathbf{x})\) at a threshold \(\tau \in [0,1]\), that is, return \(\mathbf{1}\{\hat{f}(\mathbf{x}) > \tau\}\).

We will explain in a later chapter where this choice comes from.

The purpose of this chapter is to develop some of the mathematical theory and algorithms needed to solve this type of optimization formulation.