# Improved Seeding For Clustering With K-Means++

Clustering data into subsets is an important task for many data science applications. At The Data Science Lab we have illustrated how Lloyd’s algorithm for k-means clustering works, including snapshots of python code to visualize the iterative clustering steps. One of the issues with the procedure is that

this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods,

so that we went a step further and coded up the gap statistic to find the proper k for k-means clustering. In combination with the clustering algorithm, the gap statistic allows to estimate the best value for k among those in a given range.

An additional problem with the standard k-means procedure still remains though, as shown by the image on the right, where a poor random initialization of the centroids leads to suboptimal clustering:

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution.

The initialization problem for the k-means algorithm is an important practical one, and has been discussed extensively. It is desirable to augment the standard k-means clustering procedure with a robust initialization mechanism that guarantees convergence to the optimal solution.

### k-means++: the advantages of careful seeding

A solution called k-means++ was proposed in 2007 by Arthur and Vassilvitskii. This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution. It is also fairly simple to describe and implement. Starting with a dataset *X* of *N* points ,

- choose an initial center uniformly at random from
*X*. Compute the vector containing the square distances between all points in the dataset and : - choose a second center from
*X*randomly drawn from the probability distribution - recompute the distance vector as
- choose a successive center and recompute the distance vector as
- when exactly
*k*centers have been chosen, finalize the initialization phase and proceed with the standard k-means algorithm

The interested reader can find a review of the k-means++ algorithm at normaldeviate, a survey of implementations in several languages at rosettacode and a ready-to-use solution in pandas by Jack Maney in github.

### A python implementation of the k-means++ algorithm

Out python implementation of the k-means++ algorithm builds on the code for standard k-means shown in the previous post. The `KMeans`

class defined below contains all necessary functions and methods to generate toy data and run the Lloyd’s clustering algorithm on it:

class KMeans(): def __init__(self, K, X=None, N=0): self.K = K if X == None: if N == 0: raise Exception("If no data is provided, \ a parameter N (number of points) is needed") else: self.N = N self.X = self._init_board_gauss(N, K) else: self.X = X self.N = len(X) self.mu = None self.clusters = None self.method = None def _init_board_gauss(self, N, k): n = float(N)/k X = [] for i in range(k): c = (random.uniform(-1,1), random.uniform(-1,1)) s = random.uniform(0.05,0.15) x = [] while len(x) < n: a,b = np.array([np.random.normal(c[0],s),np.random.normal(c[1],s)]) # Continue drawing points from the distribution in the range [-1,1] if abs(a) and abs(b)<1: x.append([a,b]) X.extend(x) X = np.array(X)[:N] return X def plot_board(self): X = self.X fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) if self.mu and self.clusters: mu = self.mu clus = self.clusters K = self.K for m, clu in clus.items(): cs = cm.spectral(1.*m/self.K) plt.plot(mu[m][0], mu[m][1], 'o', marker='*', \ markersize=12, color=cs) plt.plot(zip(*clus[m])[0], zip(*clus[m])[1], '.', \ markersize=8, color=cs, alpha=0.5) else: plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5) if self.method == '++': tit = 'K-means++' else: tit = 'K-means with random initialization' pars = 'N=%s, K=%s' % (str(self.N), str(self.K)) plt.title('\n'.join([pars, tit]), fontsize=16) plt.savefig('kpp_N%s_K%s.png' % (str(self.N), str(self.K)), \ bbox_inches='tight', dpi=200) def _cluster_points(self): mu = self.mu clusters = {} for x in self.X: bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \ for i in enumerate(mu)], key=lambda t:t[1])[0] try: clusters[bestmukey].append(x) except KeyError: clusters[bestmukey] = [x] self.clusters = clusters def _reevaluate_centers(self): clusters = self.clusters newmu = [] keys = sorted(self.clusters.keys()) for k in keys: newmu.append(np.mean(clusters[k], axis = 0)) self.mu = newmu def _has_converged(self): K = len(self.oldmu) return(set([tuple(a) for a in self.mu]) == \ set([tuple(a) for a in self.oldmu])\ and len(set([tuple(a) for a in self.mu])) == K) def find_centers(self, method='random'): self.method = method X = self.X K = self.K self.oldmu = random.sample(X, K) if method != '++': # Initialize to K random centers self.mu = random.sample(X, K) while not self._has_converged(): self.oldmu = self.mu # Assign all points in X to clusters self._cluster_points() # Reevaluate centers self._reevaluate_centers()

To initalize the board with n data points normally distributed around k centers, we call `kmeans = KMeans(k, N=n)`

.

kmeans = KMeans(3, N=200) kmeans.find_centers() kmeans.plot_board()

The snippet above creates a board with 200 points around 3 clusters. The call to the `find_centers()`

function runs the standard k-means algorithm initializing the centroids to 3 random points. Finally, the function `plot_board()`

produces a plot of the data points as clustered by the algorithm, with the centroids marked as stars. In the image below we can see the results of running the algorithm twice. Due to the random initialization of the standard k-means, the correct solution is found some of the times (right panel) whereas in some cases a suboptimal end point is reached instead (left panel).

Let us now implement the k-means++ algorithm in its own class, which inherits from the `class Kmeans`

defined above.

class KPlusPlus(KMeans): def _dist_from_centers(self): cent = self.mu X = self.X D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X]) self.D2 = D2 def _choose_next_center(self): self.probs = self.D2/self.D2.sum() self.cumprobs = self.probs.cumsum() r = random.random() ind = np.where(self.cumprobs >= r)[0][0] return(self.X[ind]) def init_centers(self): self.mu = random.sample(self.X, 1) while len(self.mu) < self.K: self._dist_from_centers() self.mu.append(self._choose_next_center()) def plot_init_centers(self): X = self.X fig = plt.figure(figsize=(5,5)) plt.xlim(-1,1) plt.ylim(-1,1) plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5) plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro') plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \ bbox_inches='tight', dpi=200)

To run the k-means++ initialization stage using this class and visualize the centers found by the algorithm, we simply do:

kplusplus = KPlusPlus(5, N=200) kplusplus.init_centers() kplusplus.plot_init_centers()

Let us explore what the function `init_centers()`

is actually doing: to begin with, a random point is chosen as first center from the *X* data points as `random.sample(self.X, 1)`

. Then, the successive centers are picked, stopping when we have *K*=5 of them. The procedure to choose the next most suitable center is coded up in the `_choose_next_center()`

function. As we described above, the next center is drawn from a distribution given by the normalized distance vector . To implement such a probability distribution, we compute the cumulative probabilities for choosing each of the *N* points in *X*. These cumulative probabilities are partitions in the interval [0,1] with length equal to the probability of the corresponding point being chosen as a center, as explained in this stackoverflow thread. Therefore, by picking a random value and finding the point corresponding to the segment of the partition where that value falls, we are effectively choosing a point drawn according to the desired probability distribution. On the right is a plot showing the results of the algorithm for 200 points and 5 clusters.

Finally let us compare the results of k-means with random initialization and k-means++ with proper seeding, using the following code snippets:

# Random initialization kplusplus.find_centers() kplusplus.plot_board() # k-means++ initialization kplusplus.find_centers(method='++') kplusplus.plot_board()

The standard algorithm with random initialization in a particular instantiation (left panel) fails at identifying the 5 optimal centroids for the clustering, whereas the k-means++ initialization (right panel) succeeds in doing so. By picking up a specific and not random set of centroids to initiate the clustering process, the k-means++ algorithm also reaches convergence faster, guaranteed by the theorems proved in the Arthur and Vassilvitskii article.

### Table-top data experiment take-away message

The k-means++ method for finding a proper seeding for the choice of initial centroids yields considerable improvement over the standard Lloyd’s implementation of the k-means algorithm. The initial selection in k-means++ takes extra time and involves choosing centers in a successive order and drawing them from a particular probability distribution that has to be recomputed at each step. However, by doing so, the k-means part of the algorithm converges very quickly after this seeding and thus the whole procedure actually runs in a shorter computation time. The combination of the k-means++ initialization stage with the standard Lloyd’s algorithm, together with additional various techniques to find out an optimal value for the ideal number of clusters, poses a robust way to solve the complete problem of clustering data points.

to classes work, add these imports:

import numpy as np

import random

import matplotlib.pyplot as plt

import matplotlib.cm as cm

Thanks for the comment, you’re totally right!

As this post follows up on the previous ones in the clustering series, I didn’t include the imports again in the code shown here. Thanks for noticing!

when I run the instruction!

kmeans.find_centers()

I got error:

Traceback (most recent call last):

File “”, line 1, in

kmeans.find_centers()

File “C:/Python33/KMeans.py”, line 90, in find_centers

self.oldmu = random.sample(X, K)

File “C:\Python33\lib\random.py”, line 298, in sample

raise TypeError(“Population must be a sequence or set. For dicts, use list(d).”)

TypeError: Population must be a sequence or set. For dicts, use list(d).

code does not work

`self.mu = random.sample(X, K)` doesn’t do what you intend. For the random initialization method, I think you meant something more like:

indices = np.random.choice(len(X), 2 * K)

self.oldmu = X[indices[:K]]

self.mu = X[indices[K:]]

This picks two sets of (guaranteed distinct) points as your initial values for the cluster centers.

I think your implementation of the K-means++ algorithm is inefficient for large values of k, because it involves O(k^2) calculations but need only involve O(k). I found this also to be the case in the Matlab Statistical toolbox version of the code. The key is in the step where you compute the distance between each point in X and the nearest centre already chosen. The point is that you don’t need to recalculate the distances for each centre already chosen, because you have already worked that out on the previous step. So you only need an array of distances of the length of X and then compute the distance of each X to the SINGLE new centre you have chosen, and if that is less than the existing minimum for X then replace it in the array.

When I implemented this in Matlab (about 6 lines of code!) it made a dramatic speed up – around a factor 100 when I was using a large number of centres.

I am still trying to learn Python, but it looks to me that you are recomputing the distance of every x to every c each time. (If I’m wrong please forgive!)

It probably makes it easier to understand the algorithm from a tutorial point of view, but it doesn’t scale well with increasing k.

Thanks for this thoughtful comment! Indeed the code is not written for speed or optimised in any way. As you mention, it was written in the spirit of a tutorial but not intended for production. Your suggested fixes would absolutely speed up the performance and I’m glad you pointed them out.

Error message:

File “..kmeans_seed_0413.py”, line 153, in main

kplusplus.find_centers(method=’++’)

File “..kmeans_seed_0413.py”, line 102, in find_centers

while not self._has_converged():

File “..kmeans_seed_0413.py”, line 91, in _has_converged

return(set([tuple(a) for a in self.mu]) == set([tuple(a) for a in self.oldmu])\

TypeError: ‘NoneType’ object is not iterable

You should use ‘probability’ and not ‘cumulative probability’ as the distribution, or else cluster quality is unchanged from random seeding.

What does the “[0][0]” do in “ind = np.where(self.cumprobs >= r)[0][0]”? I do not use numpy, and I am trying to code this algorithm in pure Python.