# Selection of K in K-means Clustering, Reloaded

This article follows up on the series devoted to k-means clustering at The Data Science Lab. Previous posts have dealt with how to implement Lloyd’s algorithm for clustering in python, described an improved initialization algorithm for proper seeding of the initial clusters, k-means++, and introduced the gap statistic as a method of finding the optimal K for k-means clustering.

Although the gap statistic, based on a paper by Tibshirani et al was shown to find optimal values for the number of clusters in a variety of cases when the clusters where globular and mildly disjointed, its performance might be hampered by the need of perfoming Monte Carlo simulations to estimate the reference datasets. A reader of this blog, Jonathan Stray, pointed out a potentially superior method for selecting the K in k-means clustering, so let us implement it and compare.

### An alternative approach to finding the optimal K

The approach suggested by our reader is based on a publication by Pham, Dimov and Nguyen from 2004. The article is very much worth reading, as it includes an explanation of the drawbacks of the standard k-means algorithm as well as a comprehensive survey on different methods that have been proposed for selecting an optimal number of clusters.

In section 3 of the paper, the authors justify the introduction of a function $f(K)$ to evaluate the quality of the resulting clustering and help decide on the optimal value of $K$ for each data set. Quoting from the paper:

A data set with $n$ objects could be grouped into any number of clusters between 1 and $n$, which would correspond to the lowest and the highest levels of detail respectively. By specifying different $K$ values, it is possible to assess the results of grouping objects into various numbers of clusters. From this evaluation, more than one $K$ value could be recommended to users, but the ﬁnal selection is made by them.

The goal of a clustering algorithm is to identify regions in which the data points are concentrated. It is also important to analyze the internal distribution of each cluster as well as its relation to other clusters in the data set. The distorsion of a cluster is a measure of the distance between points in a cluster and its centroid:

$\displaystyle I_j = \sum_{\mathrm{x}_i \in C_j} ||\mathrm{x}_i - \mu_j ||^2$.

The global impact of all clusters’ distortions is given by the quantity

$\displaystyle S_k = \sum_{j=1}^K I_j$.

The authors Pham et al. proceed to discuss further constrains that the sought-after function $f(K)$ should verify for it to be informative to the problem of selection of K. They finally arrive at the following definition:

$N_d$ is the number of dimensions (attributes) of the data set and $\alpha_K$ is a weight factor. With this definition, $f(K)$ is the ratio of the real distortion to the estimated distortion and decreases when there are areas of concentration in the data distribution. Values of $K$ that yield small $f(K)$ can be regarded as giving well-deﬁned clusters.

### A python implementation of Pham et al. f(K)

Our implementation of the Pham et al. procedure builds on the KMeans and KPlusPlus python classes defined in our article on the k-means++ algorithm. We define a new class that inherits from KPlusPlus and contains a function to compute $f(K)$:

class DetK(KPlusPlus):
def fK(self, thisk, Skm1=0):
X = self.X
Nd = len(X[0])
a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Sk = sum([np.linalg.norm(mu[i]-c)**2 \
for i in range(thisk) for c in clusters[i]])
if thisk == 1:
fs = 1
elif Skm1 == 0:
fs = 1
else:
fs = Sk/(a(thisk,Nd)*Skm1)
return fs, Sk


Note the recursive definition of $\alpha_K$ (variable a in the code snapshot above) and the fact that the computation of $S_K$ for $K > 1$ requires knowing the value of $S_{K-1}$, which is passed as input parameter to the function.

This article aims at showing that the Pham et al. procedure works and is computationally more efficient than the gap statistic. Therefore, we will code up the algorithm for the gap statistic within the same class DetK, so that we can run both procedures simultaneously. The full code is below the fold:

class DetK(KPlusPlus):
def fK(self, thisk, Skm1=0):
X = self.X
Nd = len(X[0])
a = lambda k, Nd: 1 - 3/(4*Nd) if k == 2 else a(k-1, Nd) + (1-a(k-1, Nd))/6
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Sk = sum([np.linalg.norm(mu[i]-c)**2 \
for i in range(thisk) for c in clusters[i]])
if thisk == 1:
fs = 1
elif Skm1 == 0:
fs = 1
else:
fs = Sk/(a(thisk,Nd)*Skm1)
return fs, Sk

def _bounding_box(self):
X = self.X
xmin, xmax = min(X,key=lambda a:a[0])[0], max(X,key=lambda a:a[0])[0]
ymin, ymax = min(X,key=lambda a:a[1])[1], max(X,key=lambda a:a[1])[1]
return (xmin,xmax), (ymin,ymax)

def gap(self, thisk):
X = self.X
(xmin,xmax), (ymin,ymax) = self._bounding_box()
self.init_centers(thisk)
self.find_centers(thisk, method='++')
mu, clusters = self.mu, self.clusters
Wk = np.log(sum([np.linalg.norm(mu[i]-c)**2/(2*len(c)) \
for i in range(thisk) for c in clusters[i]]))
# Create B reference datasets
B = 10
BWkbs = zeros(B)
for i in range(B):
Xb = []
for n in range(len(X)):
Xb.append([random.uniform(xmin,xmax), \
random.uniform(ymin,ymax)])
Xb = np.array(Xb)
kb = DetK(thisk, X=Xb)
kb.init_centers(thisk)
kb.find_centers(thisk, method='++')
ms, cs = kb.mu, kb.clusters
BWkbs[i] = np.log(sum([np.linalg.norm(ms[j]-c)**2/(2*len(c)) \
for j in range(thisk) for c in cs[j]]))
Wkb = sum(BWkbs)/B
sk = np.sqrt(sum((BWkbs-Wkb)**2)/float(B))*np.sqrt(1+1/B)
return Wk, Wkb, sk

def run(self, maxk, which='both'):
ks = range(1,maxk)
fs = zeros(len(ks))
Wks,Wkbs,sks = zeros(len(ks)+1),zeros(len(ks)+1),zeros(len(ks)+1)
# Special case K=1
self.init_centers(1)
if which == 'f':
fs[0], Sk = self.fK(1)
elif which == 'gap':
Wks[0], Wkbs[0], sks[0] = self.gap(1)
else:
fs[0], Sk = self.fK(1)
Wks[0], Wkbs[0], sks[0] = self.gap(1)
# Rest of Ks
for k in ks[1:]:
self.init_centers(k)
if which == 'f':
fs[k-1], Sk = self.fK(k, Skm1=Sk)
elif which == 'gap':
Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k)
else:
fs[k-1], Sk = self.fK(k, Skm1=Sk)
Wks[k-1], Wkbs[k-1], sks[k-1] = self.gap(k)
if which == 'f':
self.fs = fs
elif which == 'gap':
G = []
for i in range(len(ks)):
G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1]))
self.G = np.array(G)
else:
self.fs = fs
G = []
for i in range(len(ks)):
G.append((Wkbs-Wks)[i] - ((Wkbs-Wks)[i+1]-sks[i+1]))
self.G = np.array(G)

def plot_all(self):
X = self.X
ks = range(1, len(self.fs)+1)
fig = plt.figure(figsize=(18,5))
# Plot 1
ax1.set_xlim(-1,1)
ax1.set_ylim(-1,1)
ax1.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
tit1 = 'N=%s' % (str(len(X)))
ax1.set_title(tit1, fontsize=16)
# Plot 2
ax2.set_ylim(0, 1.25)
ax2.plot(ks, self.fs, 'ro-', alpha=0.6)
ax2.set_xlabel('Number of clusters K', fontsize=16)
ax2.set_ylabel('f(K)', fontsize=16)
foundfK = np.where(self.fs == min(self.fs))[0][0] + 1
tit2 = 'f(K) finds %s clusters' % (foundfK)
ax2.set_title(tit2, fontsize=16)
# Plot 3
ax3.bar(ks, self.G, alpha=0.5, color='g', align='center')
ax3.set_xlabel('Number of clusters K', fontsize=16)
ax3.set_ylabel('Gap', fontsize=16)
foundG = np.where(self.G > 0)[0][0] + 1
tit3 = 'Gap statistic finds %s clusters' % (foundG)
ax3.set_title(tit3, fontsize=16)
ax3.xaxis.set_ticks(range(1,len(ks)+1))
plt.savefig('detK_N%s.png' % (str(len(X))), \
bbox_inches='tight', dpi=100)


For a first experiment comparing the Pham et al. and the gap statistic approaches, we create a data set comprising 300 points around 2 Gaussian-distributed clusters. We run both methods to select $K$ spanning the values $K=1, \ldots, 9$. (The function run from class DetK takes a value $k_{thr}$ as input and checks all values such that $k < k_{thr}$.) Note that every run of the k-means clustering algorithm for different values of $K$ is preceded by the k-means++ initialization algorithm, to prevent landing at suboptimal clustering solutions.

To run a full comparison of both methods, the following simple commands are invoked:

kpp = DetK(2, N=300)
kpp.run(10)
kpp.plot_all()


This produces the following result plots:

According to Pham et al. lower values of $f(K)$, and especially values $f(K) < 0.85$ are an indication of cluster-like features in the data at that particular $K$. In the case of $K=2$, the global minimum of $f(K)$ in the central plot leaves no doubt that this is the right value to choose for this particular data configuration. The gap statistic, depicted in the plot on the right, yields the same result of $K=2$. Remember that the optimal $K$ with the gap statistic is the smallest value for which the gap quantity becomes positive.

Similarly, we can analyze a data set consisting of 100 points around a single cluster. The results are shown in the plots below. We observe how the function $f(K)$ does not show any prominent valley or value for which $f(K) < 0.85$ for any of the surveyed $K$s. According to the Pham et al. paper, this is an indication of no clustering, as is the case. The gap statistic agrees that there is no more than one cluster in this case.

Finally, let us look at two cases, both with 500 data points around 4 clusters. Below are the plots of the results:

For the data distribution on the top, one can see that the 4 clusters are positioned in such a way that they could also be interpreted as 2 clusters made of 2 subclusters each. The $f(K)$ detects this configuration and suggests 2 possible values of $K$, with a slight preference for $K=2$ over $K=4$. The gap statistic changes sign at $K=2$, albeit barely, and it does it again and more clearly at $K=4$. In both cases, a strict application of the rules prescribed to select the correct $K$ does lead to a rather suboptimal, or at least dubious, choice.

In the bottom plot however, the 4 clusters are somehow more evenly spreaded and both algorithms succeed at identifying $K=4$. The $f(K)$ method still shows a relative minimum at $K=2$, indicating a potentially alternative clustering.

### Performance comparison of f(K) and the gap statistic

If both methods to select the optimal $K$ for k-means clustering yield similar results, one should ask about the relative performance of them in real-life data science clustering problems. It is straightforward to predict that the gap statistic, with its need for running the k-means algorithm multiple times to create a Monte Carlo reference distribution, will necessarily be a poorer performer. We can easily test this hypothesis with our code by running both approaches and timing them using the IPython magic %time function. For a data set with $N = 500$:

%time kpp.run(10, which='f')


CPU times: user 2.72 s, sys: 0.00 s, total: 2.72 s Wall time: 2.90 s

%time kpp.run(10, which='gap')


CPU times: user 51.30 s, sys: 0.01 s, total: 51.31 s Wall time: 51.40 s

In this particular example, the $f(K)$ method is more than one order of magnitude more performant than the gap statistic, and this comparison looks worse for the latter the more data we take into consideration and the larger the number $B$ employed for generating the reference distributions.

### Table-top data experiment take-away message

The estimation of the optimal number of clusters within a set of data points is a very important problem, as most clustering algorithms need that parameter as input in order to group the data. Many methods have been proposed to find the proper $K$, among which the approach proposed by Pham et al. in 2004 seems to offer a very straightforward and performant solution. The estimation of the function $f(K)$ over the desired range of test values for $K$ offers an immediate way of assessing when the cluster-like features appear and allows to choose among a best value and other alternatives. A comparison in performance with the gap statistic method of Tibshirani et al. concludes that the $f(K)$ is computationally advantageous.

# Clustering With K-Means in Python

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

### Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of X that assigns each of the points in X to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters $C_k$ and the K centroids $\mu_k$ can be expressed as:

Minimize $\displaystyle \sum_{k=1}^K \sum_{\mathrm{x}_n \in C_k} ||\mathrm{x}_n - \mu_k ||^2$ with respect to $\displaystyle C_k, \mu_k$.

### Lloyd’s algorithm

Finding the solution is unfortunately NP hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations. (1) Once a set of centroids $\mu_k$ is available, the clusters are updated to contain the points closest in distance to each centroid. (2) Given a set of clusters, the centroids are recalculated as the means of all points belonging to a cluster.

$\displaystyle C_k = \{\mathrm{x}_n : ||\mathrm{x}_n - \mu_k|| \leq \mathrm{\,\,all\,\,} ||\mathrm{x}_n - \mu_l||\}\qquad(1)$

$\displaystyle \mu_k = \frac{1}{C_k}\sum_{\mathrm{x}_n \in C_k}\mathrm{x}_n\qquad(2)$

The two-step procedure continues until the assignments of clusters and centroids no longer change. As already mentioned, the convergence is guaranteed but the solution might be a local minimum. In practice, the algorithm is run multiple times and averaged. For the starting set of centroids, several methods can be employed, for instance random assignation.

Below is a simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

import numpy as np

def cluster_points(X, mu):
clusters  = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters

def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu

def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])

def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)


### Clustering in action

Let’s see the algorithm in action! For an ensemble of 100 random points on the plane, we set the k-means function to find 7 clusters. The code converges in 7 iterations after initializing with random centers. In the following plots, dots correspond to the target data points X and stars represent the centroids $\mu_k$ of the clusters. Each cluster is distinguished by a different color.

The initial configuration of points for the algorithm is created as follows:

import random

def init_board(N):
X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)])
return X


For a configuration with twice as many points and a target of 3 clusters, often the algorithm needs more iterations to converge.

Obviously, an ensemble of randomly generated points does not possess a natural cluster-like structure. To make things slightly more tricky, we want to modify the function that generates our initial data points to output a more interesting structure. The following routine constructs a specified number of Gaussian distributed clusters with random variances:

def init_board_gauss(N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1, 1), random.uniform(-1, 1))
s = random.uniform(0.05,0.5)
x = []
while len(x) < n:
a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) < 1 and abs(b) < 1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X


Let us look at a data set constructed as X = init_board_gauss(200,3): 7 iterations are needed to find the 3 centroids.

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution. This is shown in the example below, where initial data using very peaked Gaussians is constructed:

The yellow and black stars serve two different clusters each, while the orange, red and blue centroids are cramped within one unique blob due to an unfortunate random initialization. For this type of cases, a cleverer election of initial clusters should help.

To finalize our table-top experiment on k-means clustering, we might want to take a look at what happens when the original data space is densely populated:

The k-means algorithm induces a partition of the observations corresponding to a Voronoi tessellation generated by the K centroids. And it is certainly very pretty!

### Table-top data experiment take-away message

Lloyd’s two-step implementation of the k-means algorithm allows to cluster data points into groups represented by a centroid. This technique is employed in many facets of machine learning, from unsupervised learning algorithms to dimensionality reduction problems. The general clustering problem is NP hard, but the iterative procedure converges always, albeit to a local minimum. Proper initialization of the centroids is important. Additionally, this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods.

Update: We explore the gap statistic as a method to determine the optimal K for clustering in this post: Finding the K in K-Means Clustering and the $f(K)$ method: Selection of K in K-means Clustering, Reloaded.

Update: For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.