1.3. Clustering: an objective, an algorithm and a guarantee#
Consider the following fundamental problem in data science.
The input: We are given
Our goal is to find a good clustering
Figure: Data points forming three clusters (Source)
The output: But first, we need to define precisely what we are trying to extract from the data. What is the mathematical structure of the solution sought? Fix a number of clusters
DEFINITION (Partition)
are pairwise disjoint, i.e.,
,cover all of
, i.e., .
EXAMPLE: Suppose we are given
So here
which corresponds to assigning data points
We number the clusters
1.3.1. The k-means objective#
Under the
Here
Our goal is to find a partition
over all partitions of
To quote Wikipedia:
In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
KNOWLEDGE CHECK: Is it possible for a global solution of the
In general, the problem is NP-hard
finding the optimal representatives for a fixed partition;
finding the optimal partition for a fixed set of representatives.
One then alternates between the two (perhaps until progress falls below a tolerance). This is reasonable since our goal, as we pointed out above, is to solve the minimization problem
where
CHAT & LEARN Ask your favorite AI chatbot about the differences between
Some useful optimization results To analyze the Lloyd’s algorithm, we will rely on a few basic observations.
Minimizing a quadratic function: To elaborate on the first step above, we review an elementary fact about quadratic functions.Consider the function
When
LEMMA (Minimum of a Quadratic Function)
Proof: By the First-Order Necessary Optimality Condition, a global minimizer of
whose unique solution is
To see that
Clearly, any other
Optimizing an additively separable function: Functions that can be written as the sum of disjoint sets of coordinates arise commonly in optimization and have convenient “separability” properties.
For vectors
LEMMA (Optimizing a Separable Function)
for a collection of functions
Proof idea: Each term in the sum defining
Proof: Let
Since
Composing with a non-decreasing function: Recall that a real-valued function
LEMMA (Composing with a Non-Decreasing Function)
Proof idea: This just follows from applying the definitions.
Proof: Let
Since
Sub-problem 1: finding the optimal representatives We denote by
EXAMPLE: (continued) Continuing the example above, the sizes of the clusters are respectively
LEMMA (Optimal Representatives)
are the centroids
Proof idea: The objective
EXAMPLE: (continued) Continuing with the previous example, we compute the optimal representatives for the fixed partition
Proof: (Optimal Representatives) Using the notation
The expression in square brackets is a quadratic function in
Therefore, by the formula for the Minimum of a Quadratic Function, is minimized at
Since each term
That the squared norm decomposes into a sum over the coordinates (which the norm itself doesn’t because of the square root) is one reason why it is convenient to use here, as was perhaps apparent in this last proof.
Sub-problem 2: finding the optimal partition Given
EXAMPLE: (continued) Continuing the example above, the clusters
LEMMA (Optimal Clustering)
is obtained as follows. For each
Proof: If
By definition, when the
EXAMPLE: (continued) Continuing the example above, suppose that we choose representatives
Then we find the cluster assignment of
The minimum is achieved for
1.3.2. Lloyd’s algorithm and its analysis#
We are now ready to describe Lloyd’s algorithm
The input X
is assumed to be a collection of k
, is the desired number of clusters. There is an optional input maxiter
for the maximum number of iterations, which is set to
We first define separate functions for the two main steps. To find the minimum of an array, we use the function numpy.argmin
. We also use numpy.linalg.norm
to compute the Euclidean distance.
def opt_reps(X, k, assign):
(n, d) = X.shape
reps = np.zeros((k, d))
for i in range(k):
in_i = [j for j in range(n) if assign[j] == i]
reps[i,:] = np.sum(X[in_i,:],axis=0) / len(in_i)
return reps
def opt_clust(X, k, reps):
(n, d) = X.shape
dist = np.zeros(n)
assign = np.zeros(n, dtype=int)
for j in range(n):
dist_to_i = np.array([LA.norm(X[j,:] - reps[i,:]) for i in range(k)])
assign[j] = np.argmin(dist_to_i)
dist[j] = dist_to_i[assign[j]]
G = np.sum(dist ** 2)
print(G)
return assign
The main function follows. Below, rng.integers(0,k,n)
is an array of n
uniformly chosen integers between 0
and k-1
(inclusive). See random.Generator.integers for details. Recall that throughout, when defining a function that uses a random number generator (RNG), we initialize the RNG outside the function and pass the RNG to it. It allows us to maintain control over the random number generation process at a higher level and ensures consistent results across multiple runs.
def kmeans(rng, X, k, maxiter=5):
(n, d) = X.shape
assign = rng.integers(0,k,n)
reps = np.zeros((k, d), dtype=int)
for iter in range(maxiter):
reps = opt_reps(X, k, assign)
assign = opt_clust(X, k, reps)
return assign
NUMERICAL CORNER: We apply our implementation of k
to X
is the following:
seed = 535
rng = np.random.default_rng(seed)
X = np.array([[1., 0.],[-2., 0.],[-2.,1.],[1.,-3.],
[-10.,10.],[2.,-2.],[-3.,1.],[3.,-1.]])
assign = kmeans(rng, X, 3)
162.7
74.8611111111111
9.083333333333334
9.083333333333334
9.083333333333334
We vizualize the output by coloring the points according to their cluster assignment.
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-11,4,-4,11])
plt.show()

We can compute the final representatives (optimal for the final assignment) by using the subroutine opt_reps
.
print(opt_reps(X, 3, assign))
[[ -2.33333333 0.66666667]
[ 1.75 -1.5 ]
[-10. 10. ]]
Each row is the center of the corresponding cluster. Note these match with the ones we previously computed. Indeed, the clustering is the same (although not necessarily in the same order).
TRY IT! Modify kmeans
to take a tolerance tol
as input and stop when the improvement in objective value G
falls below the tolerance. (Open in Colab)
KNOWLEDGE CHECK: Suppose we have infinite computational resources and run Lloyd’s algorithm forever, do you think it will necessarily converge to a global minimum? If your answer is no, can you think of an alternative algorithm that is guaranteed to output a global minimum provided enough computational resources?
Lloyd’s algorithm is only a heuristic. In particular, it is not guaranteed to find the global minimum of the
THEOREM (Convergence of
Proof idea: By the Optimal Representatives Lemma and the Optimal Clustering Lemma, each step does not increase the objective.
Proof: Let
After Step 2, the new clusters are
Combining these two inequalities gives
as claimed.
The sequence of objective values is monotone and bounded from below by
CHAT & LEARN AI chatbots can serve as great personal tutors, especially when it comes to coding which they often excel at. In particular, they can provide additional information about the code in this book. Just copy-paste a piece of code and ask “What is this code doing?” Don’t hesitate to ask follow-up questions. Here is an example using ChatGPT.
Warning: As you probably know, AI chatbots can be wrong so assess what they tell you with a critical mind and/or double-check with other sources (e.g., package documentation).
NUMERICAL CORNER: We will test our implementation of X
. As we did previously, we also remove the rows with missing values.
Figure: Which penguin species? (Credit: Made with Midjourney)
data = pd.read_csv('penguins-measurements.csv')
data = data.dropna()
X = data[['bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g']].to_numpy()
We visualize a two-dimensional slice of the data.
plt.scatter(X[:,1], X[:,3], s=5, c='k')
plt.xlabel('bill_depth_mm'), plt.ylabel('body_mass_g')
plt.show()

Observe that the features have quite different scales (tens versus thousands in the plot above). In such a case, it is common to standardize the data so that each feature has roughly the same scale. For each column of X
, we subtract its empirical mean and divide by its empirical standard deviation.
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
X = (X - mean) / std
Now we run Lloyd’s algorithm with
assign = kmeans(rng, X, 2)
1338.2046936914157
820.9361062178352
603.8787658966849
575.2587351391593
567.7837494880662
We vizualize the output as we did before, but this time coloring the data points by their cluster assignment.
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()

This clustering looks quite good. Nevertheless recall that:
in this plot we are looking at only two of the four variables while
-means uses all of them,we are not guaranteed to find the best solution,
our objective function is somewhat arbitrary, and
it is not clear what the right choice of
is.
In fact, the original dataset contained the correct answer, as provided by biologists. Despite what the figure above may lead us to believe, there are in reality three separate species. So let us try with
assign = kmeans(rng, X, 3)
1312.344945158482
577.1700837839458
428.50397345437966
392.2616692426171
383.3452894259011
The output does not seem quite right.
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()

But, remembering the warnings mentioned previously, let us look at a different two-dimensional slice.
plt.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_length (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()

Let us load up the truth and compare. We only keep those samples that were not removed because of missing values (see pandas.DataFrame.iloc
).
data_truth = pd.read_csv('penguins-species.csv')
data_truth = data_truth.iloc[data.index]
data_truth.head()
species | |
---|---|
0 | Adelie |
1 | Adelie |
2 | Adelie |
4 | Adelie |
5 | Adelie |
The species are:
species = data_truth['species']
print(species.unique())
['Adelie' 'Chinstrap' 'Gentoo']
To plot the outcome, we color the species blue-green-red using a dictionary.
species2color_dict = {'Adelie': 'blue', 'Chinstrap': 'lime', 'Gentoo': 'red'}
truth = [species2color_dict[a] for a in species]
Finally, we can compare the output to the truth. The match is quite good – but certainly not perfect.
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(6.5, 3))
ax1.scatter(X[:,0], X[:,3], s=5, c=truth)
ax1.set_title('truth')
ax2.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
ax2.set_title('kmeans')
plt.show()

Determining the appropriate number of clusters is not a straighforward problem. To quote Wikipedia:
The correct choice of
is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when equals the number of data points, ). Intuitively then, the optimal choice of will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
In practice, several heuristics are in use. Other approaches to clustering, e.g. DBSCAN and hierarchical clustering, do not require a number of clusters as input.
TRY IT! Run the analysis again, but this time without the standardization step. What do you observe? Is one feature more influential on the final output than the others? Why do you think that is? (Open in Colab)
1.3.3. Matrix form of k-means clustering#
In this section, we show that the
As we indicated before, for a collection of
We can do the same with cluster representatives. Given
Perhaps less obviously, cluster assignments can also be encoded in matrix form. Recall that, given a partition
With this notation, the representative of the cluster assigned to data point
where we used that the
EXAMPLE: (continued) Continuing with our previous example, the clusters
Suppose again that the representatives are
The corresponding matrix
Hence multiplying
Recall that the Frobenius norm of an
Using the row notation, it can be written as the sum of the squared Euclidean norms of the rows
For two matrices
Finally, we return to the
where we used the definition of the Frobenius norm.
In other words, minimizing the
Self-assessment quiz (with help from Claude, Gemini, and ChatGPT)
1 Which of these is NOT a property of a valid partition
a) The subsets are pairwise disjoint
b) The subsets cover all data points
c) Each subset is non-empty
d) Each subset contains an equal number of points
2 In the
a) The centroid of cluster
b) The number of points in cluster
c) The distance between clusters
d) The assignment of point
3 The
a) The total number of clusters.
b) The average distance between data points.
c) The sum of squared distances between each data point and its assigned cluster center.
d) The maximum distance between any two cluster centers.
4 What is a key property of the sequence of objective function values produced by the
a) It is strictly decreasing
b) It is non-increasing
c) It is strictly increasing
d) It alternates between two values
5 What is the interpretation of the matrix
a) It represents the cluster centers.
b) It represents the distances between data points.
c) It encodes the cluster assignments of each data point.
d) It represents the covariance matrix of the data.
Answer for 1: d. Justification: “Formally, we define a clustering as a partition. A partition of
Answer for 2: a. Justification: “Here
Answer for 3: c. Justification: The
Answer for 4: b. Justification: “The sequence of objective function values produced by the
Answer for 5: c. Justification: The text defines