1.3. Clustering: an objective, an algorithm and a guarantee#

The input and overall goal: Consider the following fundamental problem in data science. We are given \(n\) vectors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\) in \(\mathbb{R}^d\). Our goal is to find a good clustering: loosely speaking, we want to partition these data points into \(k\) disjoint subsets - or clusters - with small pairwise distances within clusters and large pairwise distances across clusters.

To make this rather vague problem more precise, we consider a specific objective function known as the \(k\)-means objective. Our approach here will be typical of how one might approach a mathematical data science problem. We will first formulate the problem as an optimization problem, then derive an algorithm to solve it, and finally provide some rigorous guarantees about the output.

Figure: Data points forming three clusters (Source)

Data points forming three clusters

\(\bowtie\)

The output: But first, we need to define precisely what we are trying to extract from the data. What is the mathematical structure of the solution sought? Fix a number of clusters \(k\). Formally, we define a clustering as a partition.

DEFINITION (Partition) A partition of \([n] = \{1,\ldots,n\}\) of size \(k\) is a collection of non-empty subsets \(C_1,\ldots,C_k \subseteq [n]\) that:

  • are pairwise disjoint, i.e., \(C_i \cap C_j = \emptyset\), \(\forall i \neq j\)

  • cover all of \([n]\), i.e., \(\cup_{i=1}^k C_i = [n]\).

\(\natural\)

EXAMPLE: Suppose we are given \(8\) data points in \(\mathbb{R}^2\) as follows:

\[\begin{align*} &\mathbf{x}_1 = \begin{pmatrix}1\\0\end{pmatrix}, \mathbf{x}_2 = \begin{pmatrix}-2\\0\end{pmatrix}, \mathbf{x}_3 = \begin{pmatrix}-2\\1\end{pmatrix}, \mathbf{x}_4 = \begin{pmatrix}1\\-3\end{pmatrix},\\ &\mathbf{x}_5 = \begin{pmatrix}-10\\10\end{pmatrix}, \mathbf{x}_6 = \begin{pmatrix}2\\-2\end{pmatrix}, \mathbf{x}_7 = \begin{pmatrix}-3\\1\end{pmatrix}, \mathbf{x}_8 = \begin{pmatrix}3\\-1\end{pmatrix}. \end{align*}\]

So here \(n=8\) and \(d = 2\). Assume we look for \(k = 3\) clusters. Then a valid clustering would be for instance:

\[ C_1 = \{1, 4, 6, 8\}, C_2 = \{2, 3, 7\}, C_3 = \{5\}, \]

which corresponds to assigning data points \(\mathbf{x}_1, \mathbf{x}_4, \mathbf{x}_6, \mathbf{x}_8\) to the first cluster, data points \(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_7\) to the second cluster and data point \(\mathbf{x}_5\) to the third cluster. Note in particular that the sets \(C_1, C_2, C_3\) satisfy the conditions of a partition, i.e., they are disjoint and cover all of \([8] = \{1,2,\ldots,8\}\). Or put differently, each data point is assigned to one and exactly one cluster. \(\lhd\)

We number the clusters \(C_1,\ldots,C_k\) for notational convenience, but their order is meaningless. Two partitions are the same if they are the same family of subsets. E.g., in the previous example, \(C_1 = \{1, 4, 6, 8\}, C_2 = \{2, 3, 7\}, C_3 = \{5\}\) and \(C_1 = \{5\}, C_2 = \{1, 4, 6, 8\}, C_3 = \{2, 3, 7\}\) are equivalent clusterings.

1.3.1. The k-means objective#

Under the \(k\)-means objective, the “cost” of \(C_1,\ldots,C_k\) is defined as

\[ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2. \]

Here \(\boldsymbol{\mu}_i \in \mathbb{R}^d\) is the representative – or center – of cluster \(C_i\). Note that \(\boldsymbol{\mu}_i\) need not be one of the \(\mathbf{x}_j\)’s.

Our goal is to find a partition \(C_1,\ldots,C_k\) that minimizes \(\mathcal{G}(C_1,\ldots,C_k)\), i.e., solves the problem

\[ \min_{C_1,\ldots,C_k} \mathcal{G}(C_1,\ldots,C_k) \]

over all partitions of \([n]\) of size \(k\). This is a finite optimization problem, as there are only a finite number of such partitions. Note, however, that the objective function itself is an optimization problem over \(\mathbb{R}^d \times \cdots \times \mathbb{R}^{d}\), that is, \(k\) copies of \(\mathbb{R}^d\).

To quote Wikipedia:

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Figure: Illustration of k-means clustering (Source)

Illustration of k-means clustering

\(\bowtie\)

YES OR NO: Is it possible for a global solution of the \(k\)-means clustering problem to contain an empty cluster? \(\ddagger\)

In general, the problem is NP-hard, that is, roughly speaking no “fast” algorithm is expected to exist to solve it. The \(k\)-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:

  1. finding the optimal representatives for a fixed partition;

  2. finding the optimal partition for a fixed set of representatives.

One then alternates between the two (perhaps until progress falls below a tolerance). This is reasonable since our goal, as we pointed out above, is to solve the minimization problem

\[ \min_{C_1,\ldots,C_k} \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 \]

where \(C_1,\ldots,C_k\) ranges over all partitions of \([n]\) of size \(k\). Fixing partition \(C_1,\ldots,C_k\) and miniminizing over \(\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d\) corresponds to solving the first problem above, while fixing \(\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k \in \mathbb{R}^d\) and miniminizing over partitions \(C_1,\ldots,C_k\) corresponds to solving the second problem.

LEARNING BY CHATTING: Ask your favorite AI chatbot about the differences between \(k\)-means, \(k\)-medians, and \(k\)-medoids clustering.

Some useful optimization results To analyze the \(k\)-means algorithm, we will rely on a few basic observations.

Minimizing a quadratic function: To elaborate on the first step above, we review an elementary fact about quadratic functions.Consider the function

\[ q(x) = a x^2 + b x + c. \]

When \(a > 0\), \(q\) has a unique minimum.

LEMMA (Minimum of a Quadratic Function) Let \(q(x) = a x^2 + b x + c\) where \(a > 0\) and \(x \in \mathbb{R}\). The unique global minimum of \(q\) is attained at

\[ x^* = -\frac{b}{2a}. \]

\(\flat\)

Proof: By the First-Order Necessary Condition, a global minimizer of \(q\) (which is necessarily a local minimizer) satisfies the condition

\[ \frac{\mathrm{d}}{\mathrm{d}x} q(x) = 2 a x + b = 0, \]

whose unique solution is

\[ x^*= -\frac{b}{2a}. \]

To see that \(x^*\) is indeed a global minimizer, we re-write \(q\) as

\[\begin{align*} q(x) &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x\right) + c\\ &= a \left(x^2 + 2 \left[\frac{b}{2a}\right] x + \left[\frac{b}{2a}\right]^2\right) - a \left[\frac{b}{2a}\right]^2 + c\\ &= a (x - x^*)^2 + \left[c - \frac{b^2}{4a}\right]. \end{align*}\]

Clearly, any other \(x\) gives a higher value for \(q\). The step on the second line above is called Completing the Square. \(\square\)

NUMERICAL CORNER: Here’s a numerical example. We first define a quadratic function.

def q(a, b, c, x):
    return a * (x ** 2) + b * x + c

We plot it for different values of the coefficients.

Hide code cell source
x = np.linspace(-2, 2, 100)

plt.plot(x, q(2, 4, -1, x))
plt.plot(x, q(2, -4, 4, x))
plt.plot(x, q(-2, 0, 4, x))

plt.legend(['y1', 'y2', 'y3'])

plt.show()
../../_images/5759ea07fefcc0a23792be85be275efb6b5c6ea4d02775924a17671abc0d0b04.png

\(\unlhd\)

Optimizing an additively separable function: Functions that can be written as the sum of disjoint sets of coordinates arise commonly in optimization and have convenient “separability” properties.

For vectors \(\mathbf{x}_i \in \mathbb{R}^{d_i}\), \(i \in [\ell]\), with \(\sum_{i=1}^\ell = d_i\), their concatenation is denoted as \((\mathbf{x}_1, \ldots, \mathbf{x}_\ell) \in \mathbb{R}^d\). That is the vector obtained by concatenating the coordinates of \(\mathbf{x}_1, \ldots, \mathbf{x}_\ell\) into a single vector

\[\begin{split} (\mathbf{x}_1, \ldots, \mathbf{x}_\ell) = \begin{pmatrix} \mathbf{x}_1\\ \vdots\\ \mathbf{x}_\ell \end{pmatrix}. \end{split}\]

A different way to see this is that \((\mathbf{x}_1, \ldots, \mathbf{x}_\ell)\) is a block vector with blocks \(\mathbf{x}_1, \ldots, \mathbf{x}_\ell\). For example, if \(\mathbf{x}_1 = (1, 2)\) and \(\mathbf{x}_2 = (-1, -3, -5)\), then \((\mathbf{x}_1,\mathbf{x}_2) = (1, 2, -1, -3, -5)\).

LEMMA (Optimizing a Separable Function) Assume that \(\mathbf{z} \in \mathbb{R}^d\) can be broken up into subvectors \(\mathbf{x}_i \in \mathbb{R}^{d_i}\), \(i \in [\ell]\), with \(\sum_{i=1}^\ell = d_i\) as follows:

\[ \mathbf{z} = (\mathbf{x}_1, \ldots, \mathbf{x}_\ell). \]

Suppose that the real-valued function \(h : \mathbb{R}^d \to \mathbb{R}\) can be written in the additively separable form

\[ h(\mathbf{z}) = f_1(\mathbf{x}_1) + \cdots + f_\ell(\mathbf{x}_\ell), \]

for a collection of functions \(f_i : \mathbb{R}^{d_i} \to \mathbb{R}\), \(i \in [\ell]\). If, for each \(i \in [\ell]\), \(\mathbf{x}_i^*\) is a global minimum of \(f_i\), then \(\mathbf{z}^* = (\mathbf{x}_1^*, \ldots, \mathbf{x}_\ell^*)\) is a global minimum of \(h\). \(\flat\)

Proof idea: Each term in the sum defining \(h\) depends on a separate set of coordinates and therefore is unaffected by the choices made in other terms.

Proof: Let \(\mathbf{z} = (\mathbf{x}_1, \ldots, \mathbf{x}_\ell)\). Since \(\mathbf{x}_i^*\) is a global minimum of \(f_i\), it holds that

\[ f_i(\mathbf{x}_i^*) \leq f_i(\mathbf{x}_i), \]

for all \(i\).

Hence,

\[ h(\mathbf{z}^*) = f_1(\mathbf{x}_1^*) + \cdots + f_\ell(\mathbf{x}_\ell^*) \leq f_1(\mathbf{x}_1) + \cdots + f_\ell(\mathbf{x}_\ell) = h(\mathbf{z}). \]

Since \(\mathbf{z}\) is arbitrary, we have proved the claim. \(\square\)

Composing with a non-decreasing function: Recall that a real-valued function \(f\) of a single variable is non-decreasing if

\[ x \leq y \implies f(x) \leq f(y). \]

LEMMA (Composing with a Non-Decreasing Function) Let \(f : \mathbb{R} \to \mathbb{R}\) be non-decreasing, let \(g : \mathbb{R}^d \to \mathbb{R}\), and define \(h(\mathbf{x}) = f(g(\mathbf{x}))\). If \(\mathbf{x}^*\) is a global minimum of \(g\), then it is also a global minimum of \(h\). \(\flat\)

Proof idea: This just follows from applying the definitions.

Proof: Let \(\mathbf{x} \in \mathbb{R}^d\). Because \(\mathbf{x}^*\) is a global minimum of \(g\),

\[ g(\mathbf{x}^*) \leq g(\mathbf{x}). \]

Further, since \(f\) is non-decreasing,

\[ h(\mathbf{x}^*) = f(g(\mathbf{x}^*)) \leq f(g(\mathbf{x})) = h(\mathbf{x}). \]

Since \(\mathbf{x}\) is arbitrary, we have proved the claim. \(\square\)

Sub-problem 1: finding the optimal representatives We denote by \(|C_i|\) the number of elements in \(C_i\).

EXAMPLE: (continued) Continuing the example above, the sizes of the clusters are respectively \(|C_1| = 4, |C_2| = 3, |C_3| = 1\). Note in particulat that \(|C_1| + |C_2| + |C_3| = 8 = n\), as follows from the fact that \(C_1, C_2, C_3\) is a partition. \(\lhd\)

LEMMA (Optimal Representatives) Fix a partition \(C_1,\ldots,C_k\). The optimal representatives under the objective

\[ G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) = \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2, \]

are the centroids

\[ \boldsymbol{\mu}_i^* = \frac{1}{|C_i|} \sum_{j\in C_i} \mathbf{x}_j. \]

\(\flat\)

Proof idea: The objective \(G\) can be written as a sum, where each term is a quadratic function in one component of one of the \(\boldsymbol{\mu}_i\)’s. Each of these terms is minimized by the average of the corresponding components of the \(\mathbf{x}_j\)’s belonging \(C_i\).

EXAMPLE: (continued) Continuing with the previous example, we compute the optimal representatives for the fixed partition \(C_1, C_2, C_3\) above. We get

\[\begin{align*} \boldsymbol{\mu}_1^* &= \frac{1}{4}[\mathbf{x}_1 + \mathbf{x}_4 + \mathbf{x}_6 + \mathbf{x}_8]\\ &= \frac{1}{4}\left[\begin{pmatrix}1\\0\end{pmatrix} + \begin{pmatrix}1\\-3\end{pmatrix} + \begin{pmatrix}2\\-2\end{pmatrix} + \begin{pmatrix}3\\-1\end{pmatrix} \right] = \begin{pmatrix}7/4\\-3/2\end{pmatrix},\\ \boldsymbol{\mu}_2^* &= \frac{1}{3}[\mathbf{x}_2 + \mathbf{x}_3 + \mathbf{x}_7]\\ &= \frac{1}{3}\left[\begin{pmatrix}-2\\0\end{pmatrix}+ \begin{pmatrix}-2\\1\end{pmatrix}+ \begin{pmatrix}-3\\1\end{pmatrix} \right] = \begin{pmatrix}-7/3\\2/3\end{pmatrix},\\ \boldsymbol{\mu}_3^* &= \frac{1}{1}[\mathbf{x}_5] = \begin{pmatrix}-10\\10\end{pmatrix}. \end{align*}\]

\(\lhd\)

Proof: (Optimal Representatives) Using the notation \(\mathbf{x}_j = (x_{j1},\ldots,x_{jd})^T\) and similarly for \(\boldsymbol{\mu}_i\), note that we can expand the \(k\)-means objective as

\[\begin{align*} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 &= \sum_{i=1}^k \sum_{j \in C_i} \sum_{m=1}^d (x_{jm} - \mu_{im})^2\\ &= \sum_{i=1}^k \sum_{m=1}^d \left[\sum_{j \in C_i} (x_{jm} - \mu_{im})^2\right]. \end{align*}\]

The expression in square brackets is a quadratic function in \(\mu_{im}\)

\[\begin{align*} q_{im}(\mu_{im}) &= \sum_{j \in C_i} (x_{jm} - \mu_{im})^2\\ &= \left\{\sum_{j \in C_i} x_{jm}^2\right\} + \left\{- 2 \sum_{j \in C_i} x_{jm}\right\} \mu_{im} + \left\{|C_i| \right\} \mu_{im}^2. \end{align*}\]

Therefore, by the formula for the Minimum of a Quadratic Function, is minimized at

\[ \mu_{im}^* = - \frac{- 2 \sum_{j \in C_i} x_{jm}}{2 |C_i|} = \frac{1}{|C_i|} \sum_{j \in C_i} x_{jm}. \]

Since each term \(q_{im}(\mu_{im})\) in the sum over \(i, m\) making up the objective function \(G\) is minimized at \(\boldsymbol{\mu}_1^*,\ldots, \boldsymbol{\mu}_k^*\), so is \(G\) by Optimizing a Separable Function. \(\square\)

That the squared norm decomposes into a sum over the coordinates (which the norm itself doesn’t because of the square root) is one reason why it is convenient to use here, as was perhaps apparent in this last proof.

Sub-problem 2: finding the optimal partition Given \(n\) vectors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\) in \(\mathbb{R}^d\) and a partition \(C_1, \ldots, C_k \subseteq [n]\), it will be useful to have some notation for the corresponding cluster assignment: we define \(c(j) = i\) if \(j \in C_i\).

EXAMPLE: (continued) Continuing the example above, the clusters \(C_1 = \{1, 4, 6, 8\}, C_2 = \{2, 3, 7\}, C_3 = \{5\}\) correspond to the assignment

\[ c(1) = 1, c(2) = 2, c(3) = 2, c(4) = 1, c(5) = 3, c(6) = 1, c(7) = (2), c(8) = 1. \]

\(\lhd\)

LEMMA (Optimal Clustering) Fix the representatives \(\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k\). An optimal partition under the objective

\[ G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) = \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2, \]

is obtained as follows. For each \(j\), find the \(\boldsymbol{\mu}_i\) that minimizes \(\|\mathbf{x}_j - \boldsymbol{\mu}_i\|\) (picking one arbitrarily in the case of ties) and assign \(\mathbf{x}_j\) to \(C_i\) (i.e., add \(j\) to \(C_i\)). \(\flat\)

Proof: If \(c\) is the cluster assignment associated to \(C_1,\ldots,C_k\), then we can re-write the objective as

\[ \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2 = \sum_{j=1}^n \|\mathbf{x}_j - \boldsymbol{\mu}_{c(j)}\|^2, \]

By definition, when the \(\boldsymbol{\mu}_i\)’s are fixed, each term in the sum on the right-hand side is minimized separately by the assignment in the statement. Hence so is the sum itself by (a finite-space version of) Optimizing a Separable Function. Note that we used the fact that the square root (and the square) is non-decreasing to conclude that minimizing \(\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2\) or its square root \(\|\mathbf{x}_j - \boldsymbol{\mu}_i\|\) are equivalent by Composing with a Non-Decreasing Function. \(\square\)

EXAMPLE: (continued) Continuing the example above, suppose that we choose representatives

\[\begin{split} \boldsymbol{\mu}_1 = \begin{pmatrix}-2\\1\end{pmatrix}, \boldsymbol{\mu}_2 = \begin{pmatrix}2\\-1\end{pmatrix}, \boldsymbol{\mu}_3 = \begin{pmatrix}-10\\10\end{pmatrix}. \end{split}\]

Then we find the cluster assignment of \(\mathbf{x}_1\) by computing its squared distance to each representative:

\[\begin{split} \|\mathbf{x}_1 - \boldsymbol{\mu}_1\| = \left\|\begin{pmatrix}1\\0\end{pmatrix} - \begin{pmatrix}-2\\1\end{pmatrix}\right\| = \sqrt{(1-(-2))^2 + (0-1)^2} = \sqrt{10}, \end{split}\]
\[\begin{split} \|\mathbf{x}_1 - \boldsymbol{\mu}_2\| = \left\|\begin{pmatrix}1\\0\end{pmatrix} - \begin{pmatrix}2\\-1\end{pmatrix}\right\| = \sqrt{(1-2)^2 + (0-(-1))^2} = \sqrt{2}, \end{split}\]
\[\begin{split} \|\mathbf{x}_1 - \boldsymbol{\mu}_3\| = \left\|\begin{pmatrix}1\\0\end{pmatrix} - \begin{pmatrix}-10\\10\end{pmatrix}\right\| = \sqrt{(1-(-10))^2 + (0-10)^2} = \sqrt{221}. \end{split}\]

The minimum is achieved for \(\boldsymbol{\mu}_2\) so we assign \(\mathbf{x}_1\) to \(C_2\), i.e., \(1 \in C_2\) and \(c(1) = 2\). \(\lhd\)

1.3.2. The k-means algorithm and its analysis#

We are now ready to describe the \(k\)-means algorithm, also known as Lloyd’s algorithm. We start from a random assignment of clusters. (An alternative initialization strategy is to choose \(k\) representatives at random among the data points.) We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Python.

The input X is assumed to be a collection of \(n\) vectors \(\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d\) stacked into a matrix, with one row for each data point. The other input, k, is the desired number of clusters. There is an optional input maxiter for the maximum number of iterations, which is set to \(10\) by default.

We first define separate functions for the two main steps. To find the minimum of an array, we use the function numpy.argmin. We also use numpy.linalg.norm to compute the Euclidean distance.

def opt_reps(X, k, assign):
    (n, d) = X.shape
    reps = np.zeros((k, d))
    for i in range(k):
        in_i = [j for j in range(n) if assign[j] == i]             
        reps[i,:] = np.sum(X[in_i,:],axis=0) / len(in_i)
    return reps

def opt_clust(X, k, reps):
    (n, d) = X.shape
    dist = np.zeros(n)
    assign = np.zeros(n, dtype=int)
    for j in range(n):
        dist_to_i = np.array([LA.norm(X[j,:] - reps[i,:]) for i in range(k)])
        assign[j] = np.argmin(dist_to_i)
        dist[j] = dist_to_i[assign[j]]
    G = np.sum(dist ** 2)
    print(G)
    return assign

The main function follows. Below, rng.integers(0,k,n) is an array of n uniformly chosen integers between 0 and k-1 (inclusive). (See random.Generator.integers for details.)

def kmeans(X, k, maxiter=10):
    (n, d) = X.shape
    assign = rng.integers(0,k,n)
    reps = np.zeros((k, d), dtype=int)
    for iter in range(maxiter):
        # Step 1: Optimal representatives for fixed clusters
        reps = opt_reps(X, k, assign) 
        # Step 2: Optimal clusters for fixed representatives
        assign = opt_clust(X, k, reps) 
    return assign

LEARNING BY CHATTING AI chatbots can serve as great personal tutors. For instance, they can provide additional information about the code in this book. Just copy-paste a piece of code and ask “What is this Python code doing?” Don’t hesitate to ask follow-up questions. Here is an example using ChatGPT 4.

Warning: As you probably know, AI chatbots can be wrong so assess what they return with a critical mind and/or double-check with other sources (e.g., package documentation).

Prompt engineering: Finding the right prompt, or series of prompts, can make a big difference. If it helps you, here is an explanation of our \(k\)-means algorithm in the style of Shakespeare via ChatGPT 4:

Scene: The Realm of Numerica, in the Enchanted Forest of Analysis

Enter LORD VECTOR, LADY PARTITION, and an assembly of WHISPERING WINDS

LORD VECTOR (a majestic entity):

Hearken, Whispering Winds, scattered far and wide,

I am Lord Vector, your anchor and your guide.

In clusters, you must settle, by my call and by my side,

To me, you’ll draw nigh, in unity, abide.

LADY PARTITION (a sage enchantress):

And I, Lady Partition, with vision clear and keen,

Determine where amongst the clusters each of you shall lean.

With Lord Vector, a ballet we compose, with steps so fine,

To sort, to gather, to align, under a sign divine.

WHISPERING WIND 1:

But how, noble Vector, wilt thou find thy rightful stead?

In this expanse diverse, how set thy bed?

LORD VECTOR:

Through cycles and through wisdom, I seek my sacred ground,

Where the sum of all distances to me is least found.

From your essence, I emerge, an average pure and sound,

A beacon most true, by which I am crowned.

WHISPERING WIND 2:

And Lady Partition, how do we seek our kin,

In this multitude, how does our journey begin?

LADY PARTITION:

By measure and affinity, to your closest Lord Vector, you’ll gravitate,

To him, your allegiance, you’ll consecrate.

In groups, you’ll assemble, by likeness you’re drawn,

A collective symphony, from dusk till dawn.

LORD VECTOR and LADY PARTITION assume their roles, guiding the WHISPERING WINDS to move and regroup.

LORD VECTOR:

Observe now how I traverse, as your allegiances refine,

A center more apt, in this spell of mine.

LADY PARTITION:

And as you drift, dear winds, to your Lords true,

A closer union we construe, a balance anew.

The WHISPERING WINDS shuffle and align, moving from one cluster to another, as LORD VECTOR adjusts his position with grace.

WHISPERING WIND 3:

At length, I perceive, where I am to dwell,

With peers akin, in cluster well.

LORD VECTOR:

So we dance and weave, ‘til stillness we achieve,

When each wind has settled, and we, in relief.

LADY PARTITION:

Thus, optimal clusters come forth, vivid and serene,

With centers well placed, a sight unseen.

The WHISPERING WINDS now rest peacefully in clusters around their LORD VECTORS, the Enchanted Forest of Analysis a testament to their orderly ballet.

LORD VECTOR & LADY PARTITION (together):

Our task here concludes, with harmony and grace,

The K-means enchantment, a solution embraced.

Exeunt omnes, as the WHISPERING WINDS repose in unity around their LORD VECTORS, the realm of Numerica reflecting their harmonious dance.

End of Scene

NUMERICAL CORNER: We apply our implementation of \(k\)-means to the example above. We fix k to \(3\). Here the data matrix X is the following:

X = np.array([[1., 0.],[-2., 0.],[-2.,1.],[1.,-3.],
              [-10.,10.],[2.,-2.],[-3.,1.],[3.,-1.]])
assign = kmeans(X, 3)
162.7
74.8611111111111
9.083333333333334
9.083333333333334
9.083333333333334
9.083333333333334
9.083333333333334
9.083333333333334
9.083333333333334
9.083333333333334

We vizualize the output by coloring the points according to their cluster assignment.

Hide code cell source
plt.scatter(X[:,0], X[:,1], c=assign)
plt.show()
../../_images/418c97d3946c8d2b45bc970a6489632e3ad445880fefa8bc1f9ab3427cdc60e2.png

We can compute the final representatives (optimal for the final assignment) by using the subroutine opt_reps.

print(opt_reps(X, 3, assign))
[[ -2.33333333   0.66666667]
 [  1.75        -1.5       ]
 [-10.          10.        ]]

Each row is the center of the corresponding cluster. Note these match with the ones we previously computed. Indeed, the clustering is the same (although not necessarily in the same order).

\(\unlhd\)

YES OR NO: Suppose we have infinite computational resources and run Lloyd’s algorithm forever, do you think it will necessarily converge to a global minimum? If your answer is no, can you think of an alternative algorithm that is guaranteed to output a global minimum provided enough computational resources? \(\ddagger\)

The \(k\)-means algorithm is only a heuristic. In particular, it is not guaranteed to find the global minimum of the \(k\)-means objective. However, it is guaranteed to improve the objective at every iteration, or more precisely, not to make it worse.

THEOREM (Convergence of \(k\)-means cost) The sequence of objective function values produced by the \(k\)-means algorithm is non-increasing. \(\sharp\)

Proof idea: By the Optimal Representatives and Optimal Clustering lemmas, each step does not increase the objective.

Proof: Let \(C_1',\ldots,C_k'\) be the current clusters, with representatives \(\boldsymbol{\mu}_1',\ldots,\boldsymbol{\mu}_k'\). After Step 1, the new representatives are \(\boldsymbol{\mu}_1'',\ldots,\boldsymbol{\mu}_k''\). By the Optimal Representatives Lemma, they satisfy

\[ \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i'\|^2. \]

After Step 2, the new clusters are \(C_1'',\ldots,C_k''\). By the Optimal Clustering Lemma, they satisfy

\[ \sum_{i=1}^k \sum_{j \in C_i''} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2. \]

Combining these two inequalities gives

\[ \sum_{i=1}^k \sum_{j \in C_i''} \|\mathbf{x}_j - \boldsymbol{\mu}_i''\|^2 \leq \sum_{i=1}^k \sum_{j \in C_i'} \|\mathbf{x}_j - \boldsymbol{\mu}_i'\|^2, \]

as claimed. \(\square\)

The sequence of objective values is monotone and bounded from below by \(0\). Hence it converges. Note that the limit depends on the starting point.

1.3.3. Penguins dataset#

We will test our implementation of \(k\)-means on the penguins dataset introduced earlier in the chapter.

Figure: Which penguin species? (Credit: Made with Midjourney)

Penguins on a glacier

\(\bowtie\)

We first extract the columns and combine them into a data matrix X. As we did previously, we also remove the rows with missing values.

df = pd.read_csv('penguins-measurements.csv')
df = df.dropna()
X = df[['bill_length_mm', 'bill_depth_mm', 
        'flipper_length_mm', 'body_mass_g']].to_numpy()

We visualize a two-dimensional slice of the data.

Hide code cell source
plt.scatter(X[:,1], X[:,3], s=10)
plt.xlabel('bill_depth_mm')
plt.ylabel('body_mass_g')
plt.show()
../../_images/871b701ce6a6b3135f0c16c76423dd196616421a471e7e74d6ca72a1551060ec.png

Observe that the features have quite different scales (tens versus thousands in the plot above). In such a case, it is common to standardize the data so that each feature has roughly the same scale. That is accomplished by, for each column of X, subtracting its empirical mean and dividing by its empirical standard deviation. We do this next.

mean = np.mean(X, axis=0)  # Compute mean for each column
std = np.std(X, axis=0)  # Compute standard deviation for each column
X = (X - mean) / std # Standardize each column

Now we run the \(k\)-means algorithm with \(k=2\) clusters.

assign = kmeans(X, 2)
1338.2046936914157
820.9361062178352
603.8787658966849
575.2587351391593
567.7837494880662
565.7076453796291
565.7076453796291
565.7076453796291
565.7076453796291
565.7076453796291

We vizualize the output as we did before.

Hide code cell source
plt.scatter(X[:,1], X[:,3], c=assign, s=10)
plt.xlabel('bill_depth_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()
../../_images/7c36c9dbdd00ef08c56520a3c6b2feafa59a43b3635e84b549adfb1ebbf8e64e.png

This clustering looks quite good. Nevertheless recall that:

  1. in this plot we are looking at only two of the four variables while \(k\)-means uses all of them,

  2. we are not guaranteed to find the best solution,

  3. our objective function is somewhat arbitrary, and

  4. it is not clear what the right choice of \(k\) is.

In fact, the original dataset provided the correct answer. Despite what the figure above may lead us to believe, there are in reality three separate species. So let’s try with \(k=3\) clusters.

assign = kmeans(X, 3)
1312.344945158482
577.1700837839458
428.50397345437966
392.2616692426171
383.3452894259011
380.76115947528945
379.50930154236835
379.3925027555175
379.3925027555175
379.3925027555175

The output does not seem quite right.

Hide code cell source
plt.scatter(X[:,1], X[:,3], c=assign, s=10)
plt.xlabel('bill_depth_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()
../../_images/cd3c64babea9ddfe4510596c43029c265691561b5b0eb2a2aba27e5b670eca83.png

But, remembering the warnings mentioned previously, let’s look at a different two-dimensional slice.

Hide code cell source
plt.scatter(X[:,0], X[:,3], c=assign, s=10)
plt.xlabel('bill_length_mm (standardized)')
plt.ylabel('body_mass_g (standardized)')
plt.show()
../../_images/0ebf28c05a2af834c6d1322807e085f650ca563dde80b7842e046637a4d1ef6c.png

Let’s load up the truth and compare. We only keep those samples that were not removed because of missing values (see pandas.DataFrame.iloc).

df_truth = pd.read_csv('penguins-species.csv') 
df_truth = df_truth.iloc[df.index]
df_truth.head()
species
0 Adelie
1 Adelie
2 Adelie
4 Adelie
5 Adelie

The species are:

species = df_truth['species']
print(species.unique())
['Adelie' 'Chinstrap' 'Gentoo']

To plot the outcome, we color the species blue-green-red using a dictionary.

species2color = {'Adelie': 'b', 'Chinstrap': 'g', 'Gentoo': 'r'}
truth = species.replace(species2color)

Finally, we can compare the output to the truth. The match is quite excellent – but not perfect.

Hide code cell source
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True)
ax1.scatter(X[:,0], X[:,3], c=truth, s=3)
ax1.set_title('truth')
ax2.scatter(X[:,0], X[:,3], c=assign, s=3)
ax2.set_title('kmeans')
plt.show()
../../_images/58ae1da9eca7a9fb3d53ee9bd17810c15371c7cc55ae0f13d53c5adb26eecec2.png

It is informative to revisit the all pairwise scatter plots using seaborn.pairplot, updated with this new information via its option hue. We first concatenate the dataframes with pandas.concat.

df_full = pd.concat([df, df_truth], axis=1)
df_full.head()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g species
0 39.1 18.7 181.0 3750.0 Adelie
1 39.5 17.4 186.0 3800.0 Adelie
2 40.3 18.0 195.0 3250.0 Adelie
4 36.7 19.3 193.0 3450.0 Adelie
5 39.3 20.6 190.0 3650.0 Adelie
import seaborn as sns
sns.pairplot(df_full, hue="species", diag_kind="hist", height=1.5)
plt.show()
../../_images/f3adfe52a22e03628b9f3eae93589553acf5167c76bf96f0cdf1f123cbbb25b4.png

Determining the appropriate number of clusters is not a straighforward problem. To quote Wikipedia:

The correct choice of \(k\) is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing \(k\) without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when \(k\) equals the number of data points, \(n\)). Intuitively then, the optimal choice of \(k\) will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of \(k\) is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.

In practice, several heuristics are in use. Other approaches to clustering, e.g. DBSCAN and hierarchical clustering, do not require a number of clusters as input.

TRY IT! Run the analysis again, but this time without the standardization step. What do you observe? Is one feature more influential on the final output than the others? Why do you think that is? (Open in Colab)

TRY IT! Modify kmeans to take a tolerance tol as input and stop when the improvement in objective value G falls below the tolerance. (Open in Colab)

\(\unlhd\)

1.3.4. Matrix form of k-means clustering#

In this section, we show that the \(k\)-means clustering objective can be written in matrix form. We start with some notation and definitions that will be useful throughout.

As we indicated before, for a collection of \(n\) data vectors \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) in \(\mathbb{R}^d\), it is often convenient to stack them up into a matrix

\[\begin{split} X = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \vdots \\ \mathbf{x}_n^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}. \end{split}\]

We can do the same with cluster representatives. Given \(\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k\) also in \(\mathbb{R}^d\), we form the matrix

\[\begin{split} U = \begin{bmatrix} \boldsymbol{\mu}_1^T \\ \boldsymbol{\mu}_2^T \\ \vdots \\ \boldsymbol{\mu}_k^T \\ \end{bmatrix} = \begin{bmatrix} \mu_{11} & \mu_{12} & \cdots & \mu_{1d} \\ \mu_{21} & \mu_{22} & \cdots & \mu_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \mu_{k1} & \mu_{k2} & \cdots & \mu_{kd} \\ \end{bmatrix}. \end{split}\]

Perhaps less obviously, cluster assignments can also be encoded in matrix form. Recall that, given a partition \(C_1,\ldots,C_k\) of \([n]\), we define \(c(j) = i\) if \(j \in C_i\). For \(j=1,\ldots,n\) and \(\ell=1,\ldots,k\), set \(Z_{j\ell} = 1\) if \(c(j) = \ell\) and \(0\) otherwise, and let \(Z\) be the \(n \times k\) matrix with entries \(Z = [Z_{j\ell}]_{j,\ell}\). That is, row \(j\) has exactly one entry with value \(1\), corresponding to the assigned cluster \(c(j)\) of data point \(\mathbf{x}_j\), and all other entries \(0\).

With this notation, the representative of the cluster assigned to data point \(\mathbf{x}_j\) is obtained through a matrix product

\[ \boldsymbol{\mu}_{c(j)}^T = \sum_{\ell = 1}^k Z_{j\ell} \boldsymbol{\mu}_{\ell}^T = \sum_{\ell = 1}^k Z_{j\ell} U_{\ell,\cdot} = \left(Z U\right)_{j,\cdot} \]

where we used that the \(j\)-th row of a matrix product is a linear combination of the rows of the second matrix, where the coefficients are the entries on the \(j\)-th row of the first one.

EXAMPLE: (continued) Continuing with our previous example, the clusters \( C_1 = \{1, 4, 6, 8\}, C_2 = \{2, 3, 7\}, C_3 = \{5\} \) are encoded as the matrix

\[\begin{split} Z = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix}. \end{split}\]

Suppose again that the representatives are

\[\begin{split} \boldsymbol{\mu}_1 = \begin{pmatrix}-2\\1\end{pmatrix}, \boldsymbol{\mu}_2 = \begin{pmatrix}2\\-1\end{pmatrix}, \boldsymbol{\mu}_3 = \begin{pmatrix}-10\\10\end{pmatrix}. \end{split}\]

The corresponding matrix \(U\) is then

\[\begin{split} U = \begin{bmatrix} -2 & 1\\ 2 & -1\\ -10 & 10 \end{bmatrix}. \end{split}\]

Hence multiplying \(Z\) and \(U\) produces a matrix where each row is the representative of the assigned cluster of the corresponding data point

\[\begin{split} ZU = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{bmatrix} \, \begin{bmatrix} -2 & 1\\ 2 & -1\\ -10 & 10 \end{bmatrix} = \begin{bmatrix} -2 & 1\\ 2 & -1\\ 2 & -1\\ -2 & 1\\ -10 & 10\\ -2 & 1\\ 2 & -1\\ -2 & 1 \end{bmatrix}. \end{split}\]

\(\lhd\)

Recall that the Frobenius norm of an \(n \times m\) matrix \(A \in \mathbb{R}^{n \times m}\) is defined as

\[ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m A_{ij}^2}. \]

Using the row notation, it can be written as the sum of the squared Euclidean norms of the rows

\[ \|A\|_F^2 = \sum_{i=1}^n \|A_{i,\cdot}\|_2^2. \]

For two matrices \(A, B \in \mathbb{R}^{n \times m}\), the Frobenius norm of their difference \(\|A - B\|_F\) can be interpreted as a distance between \(A\) and \(B\), that is, a measure of how dissimilar they are.

Finally, we return to the \(k\)-means objective. Using the notation introduced in this section and the equivalent formula for the objective \(G\) derived the proof of the Optimal Clustering Lemma, we note that

\[\begin{align*} G(C_1,\ldots,C_k; \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k) &= \sum_{i=1}^n \|\mathbf{x}_i - \boldsymbol{\mu}_{c(i)}\|^2\\ &= \sum_{i=1}^n \sum_{\ell=1}^d \left(x_{i\ell} - (Z U)_{i\ell}\right)^2\\ &= \|X - Z U \|^2_F, \end{align*}\]

where we used the definition of the Frobenius norm.

In other words, minimizing the \(k\)-means objective is equivalent to finding a matrix factorization of the form \(ZU\) that is a good fit to the data matrix \(X\) in Frobenius norm. This formulation expresses in a more compact form the idea of representing \(X\) as a combination of a small number of representatives. Matrix factorization will come back repeatedly in this course.