Introduction: a first data science problem

Clusters

Image credit: Made with Midjourney

1. Introduction: a first data science problem#

This chapter provides a first introduction to the mathematics of data through a concrete example: clustering. This fundamental technique in data analysis and machine learning involves grouping similar data points together based on certain characteristics or features. The goal is to divide a dataset into subsets – or clusters – such that data points within the same subset are more similar to each other than to those in other subsets. We start by reviewing basic mathematical concepts that will be used throughout the book, especially matrix and vector algebra, differential calculus and optimization, and elementary probability. We also highlight some relevant phenomena arising in the context of high-dimensional data. Here is a more detailed overview of the main sections of the chapter.

Background: quick refresher of matrix algebra, differential calculus, and elementary probability This section reviews key topics such as vectors, matrices, and their operations, as well as important matrix properties like positive semidefiniteness. The section also discusses continuity, differentiability, and optimization, introducing important results like the Extreme Value Theorem and the First-Order Necessary Optimilaty Condition for local minimizers. The probability section covers expectation, variance, Chebyshev’s inequality, independence, limit theorems like the Weak Law of Large Numbers, random vectors and matrices, covariance matrices, and the normal distribution. Throughout the section, numerical examples using Python and some of its main libraries (especially NumPy and Matplotlib) are provided to illustrate the concepts and their implementation.

Clustering: an objective, an algorithm and a guarantee This section introduces clustering, a fundamental problem in data science where the goal is to partition \(n\) data points in \(d\)-dimensional space into \(k\) clusters. It starts by introducing the \(k\)-means objective function, a mathematical formulation to quantify the quality of a clustering solution. An iterative algorithm is derived that alternates between finding optimal cluster representatives (i.e., centroids) for a fixed partition and finding the optimal partition for fixed representatives. One theoretical guarantee is proved: at each step, the algorithm does not deteriorate the objective – but it may not converge to a global minimum. The section also presents a matrix formulation of \(k\)-means clustering, showing that minimizing the objective is equivalent to finding a matrix factorization that closely approximates the data in Frobenius norm. The mathematical concepts are illustrated through numerical examples and an application to identifying penguin species.

Some observations about high-dimensional data This section explores the challenges and surprising phenomena that arise when dealing with data in high-dimensional spaces. It begins by demonstrating in numerical simulations how \(k\)-means clustering may struggle to distinguish between seemingly well-separated clusters as the dimensionality increases. The section then discusses the counter-intuitive properties of high-dimensional spaces, particularly the fact that most of the volume of a high-dimensional cube is concentrated in its corners, making it appear like a “spiky ball.” These mathematical insights highlight the “curse of dimensionality” and the need for dimension-reduction techniques when working with high-dimensional data, a topic to which the book returns later on.