Clustering With K-Means in Python

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of X that assigns each of the points in X to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters $C_k$ and the K centroids $\mu_k$ can be expressed as:

Minimize $\displaystyle \sum_{k=1}^K \sum_{\mathrm{x}_n \in C_k} ||\mathrm{x}_n - \mu_k ||^2$ with respect to $\displaystyle C_k, \mu_k$.

Lloyd’s algorithm

Finding the solution is unfortunately NP hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations. (1) Once a set of centroids $\mu_k$ is available, the clusters are updated to contain the points closest in distance to each centroid. (2) Given a set of clusters, the centroids are recalculated as the means of all points belonging to a cluster. $\displaystyle C_k = \{\mathrm{x}_n : ||\mathrm{x}_n - \mu_k|| \leq \mathrm{\,\,all\,\,} ||\mathrm{x}_n - \mu_l||\}\qquad(1)$ $\displaystyle \mu_k = \frac{1}{C_k}\sum_{\mathrm{x}_n \in C_k}\mathrm{x}_n\qquad(2)$

The two-step procedure continues until the assignments of clusters and centroids no longer change. As already mentioned, the convergence is guaranteed but the solution might be a local minimum. In practice, the algorithm is run multiple times and averaged. For the starting set of centroids, several methods can be employed, for instance random assignation.

Below is a simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

import numpy as np

def cluster_points(X, mu):
clusters  = {}
for x in X:
bestmukey = min([(i, np.linalg.norm(x-mu[i])) \
for i in enumerate(mu)], key=lambda t:t)
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters

def reevaluate_centers(mu, clusters):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(np.mean(clusters[k], axis = 0))
return newmu

def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])

def find_centers(X, K):
# Initialize to K random centers
oldmu = random.sample(X, K)
mu = random.sample(X, K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Reevaluate centers
mu = reevaluate_centers(oldmu, clusters)
return(mu, clusters)

Clustering in action

Let’s see the algorithm in action! For an ensemble of 100 random points on the plane, we set the k-means function to find 7 clusters. The code converges in 7 iterations after initializing with random centers. In the following plots, dots correspond to the target data points X and stars represent the centroids $\mu_k$ of the clusters. Each cluster is distinguished by a different color. The initial configuration of points for the algorithm is created as follows:

import random

def init_board(N):
X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)])
return X

For a configuration with twice as many points and a target of 3 clusters, often the algorithm needs more iterations to converge. Obviously, an ensemble of randomly generated points does not possess a natural cluster-like structure. To make things slightly more tricky, we want to modify the function that generates our initial data points to output a more interesting structure. The following routine constructs a specified number of Gaussian distributed clusters with random variances:

def init_board_gauss(N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1, 1), random.uniform(-1, 1))
s = random.uniform(0.05,0.5)
x = []
while len(x) < n:
a, b = np.array([np.random.normal(c, s), np.random.normal(c, s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) < 1 and abs(b) < 1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X

Let us look at a data set constructed as X = init_board_gauss(200,3): 7 iterations are needed to find the 3 centroids. If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution. This is shown in the example below, where initial data using very peaked Gaussians is constructed: The yellow and black stars serve two different clusters each, while the orange, red and blue centroids are cramped within one unique blob due to an unfortunate random initialization. For this type of cases, a cleverer election of initial clusters should help.

To finalize our table-top experiment on k-means clustering, we might want to take a look at what happens when the original data space is densely populated: The k-means algorithm induces a partition of the observations corresponding to a Voronoi tessellation generated by the K centroids. And it is certainly very pretty!

Table-top data experiment take-away message

Lloyd’s two-step implementation of the k-means algorithm allows to cluster data points into groups represented by a centroid. This technique is employed in many facets of machine learning, from unsupervised learning algorithms to dimensionality reduction problems. The general clustering problem is NP hard, but the iterative procedure converges always, albeit to a local minimum. Proper initialization of the centroids is important. Additionally, this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods.

Update: We explore the gap statistic as a method to determine the optimal K for clustering in this post: Finding the K in K-Means Clustering and the $f(K)$ method: Selection of K in K-means Clustering, Reloaded.

Update: For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.

1. Pete Colligan (@linuxster)

hi. would you offer python notebook version of this?

2. Pete Colligan (@linuxster)

Hello,
You can export your ipython notebook markup and upload it to github gist. Then reference the gist URL from the http://nbviewer.ipython.org/ you mentioned above.

you can also just email me (linuxster at gmail dot com) source tarball or post your source on github and I can clone.

**I have put an example of ILP tutorial notebooks here: http://nbviewer.ipython.org/gist/linuxster/7295256

currently I am focusing on the data characteristics that feed the algorithm, especially in the case where I might automate as much as possible and therefore must take into account pre-processing (missing data, outliers, encoding, feature selection, etc)

maybe you have experience here already?

1. I am interested in exploring the effects of L1 and L2 norm on the cluster output. in your example you use the L2 norm but since Kmeans is sensitive to outliers maybe L1 norm is better?
2. also I want to explore whether I can use the mode instead of mean for the centroid recalculation in step two of LLoyds algorithm. again, something to minimize impact of outliers.
3. also if I have mixed data (categorical and numerical) that I would cluster together what is the best approach for pre-processing the data? I have heard of some folks encoding the categorical data into a binary vector format to support the distance calculations but would like to get other opinions here as well.

just saw there is a follow up post to this one on k selection. will read. 🙂

• datasciencelab

Thanks! I will set up my github/gist soon and will keep you posted.

Your discussion on L1/L2 norm is very interesting. What I’ve read so far always uses L2 to measure the intracluster variance, but exploring L1 is surely worthwhile. Re: mode versus mean, I guess it depends on the distribution of your data. My play data just follows normal distribution or random, thus I don’t think that the mode brings much more than the mean. It’d be interesting to see what happens when data follows other shapes though. Regarding mixed data, I can’t offer much insight there. What kind of categorical data do you have in mind?

3. Pete Colligan (@linuxster)

sorry. forgot to reference the notebook I refer to I cloned from nipun batra.

4. Pete Colligan (@linuxster)

I have data from a customer which includes retail store characteristics. They would like to use many store attributes to help them find new store “groupings” that they an use for business planning activities. There are about 50 attributes per store and data is numerical, ordinal, and categorical in nature. (size, nearest competitor, demographic info, sales, market segment, etc.). I would like to first:
1. get an idea of which attributes account for the most variance in the data. PCA is ok but I get a bit lost mapping from one basis to the other. (ex. which PCA component contains which features from my dataset). Goal here is some dimensionality reduction since I know some of my features are collinear.
2. after certain “redundant” information is removed, I think I should look at the data distribution of each feature to get an idea of the “nature” of the data. not quite sure here any hard and fast rules. Ex. Normalization, Regularization?
3. then I would go for clustering algorithm. At the moment I like Kmeans and agglomerative hierarchical.

what do you think about that thought process?

note: Kmeans will force every point to a cluster. Outliers are not ALWAYS bad in this use case (some stores like “super stores” are important to keep and might be an own cluster) so I don’t want to always just remove them prior to running the algorithm.

• datasciencelab

I see that your problem is another kind of beast, as you probably need to start with a multivariate analysis to determine the relative importance of each attribute (your step 1, for which you could do some ANOVA) before you can proceed to clustering on them. Your strategy seems correct, and due to the complexity of the problem, I’m sure you’ll find many interesting steps to complete along the way. Good luck!

5. Mike Files

Reblogged this on Big Data . . . and commented:
Module 4: Advanced Analytics: K-Means Clustering

6. Pingback: Grouping countries according to flag similarity - Virostatiq
7. guido

do not use i or mu[]
instead i,m enumerate(mu) directly

• datasciencelab

Thanks for pointing out the superior solution with enumerate, Guido!

8. Pingback: @TKMafioso "Family Struggles" Official Video #NEWMUSIC #IEMUSICGANG | #iRapindIE
9. nico

Hi there,
big thanks for all of that, seems to be exactly what I was looking for. I am not that new to python, but new to clustering but i guess I will find my way to use the implementations on my data.
One thing I was wondering is what would be the best way to keep track of the sample id’s, which is of importance in my research? Can someone point me in a direction?
Thanks

10. Lars

Thank you for this excellent set of examples. K-Means and Lloyd’s Algorithm are probably what I’m looking for. I’m trying to investigate a cluster problem that has a twist. Specifically, I would like to find good locations for public schools, based on the geographical distribution of students. I can estimate the X-Y coordinates of the students on a city map, and could do a simple cluster analysis. However the problem is complicated by the fact that the city has many internal barriers (waterways, railway tracks, fenced expressways, etc) which divide the city into many smaller areas, and which can only be crossed at a limited number of points. This means that two points whose x-y coordinates might be close, could in fact be separated by a barrier that forces a much greater travel distance.
The distance between a point in area A, and another point in Area B, is in fact the sum of their distances to the bridge point across the barrier. (rather than simply the distance calculated from their x-y coordinates.) That is fairly straight forward. My thinking is that I should try to intercept the cluster processing at some midpoint and substitute the practical distance for the distance calculated from the x-y coordinates, and then proceed with the analysis.
I have some experience with Python, but I have no formal training in Cluster Analysis.
My questions are:
1. Is it possible to intercept the processing and substitute different distance values.
2. Does this type of problem have a standard solution, and I simply haven’t found it.
3. Do you recommend some other approach.
Thanks,

11. vineeth

Thanks for the example. How can i implement k-means algorithm for finding severe and normal facial palsy using python.

12. Pingback: Machine Learning and K-Means Clustering on Zillabyte - Zillablog
13. joon

Thanks a lot for the post. It seems closing ) is missing from the line: return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]) in def has_converged(mu, oldmu).

• Greg S

Hi, where does the oldmu get created in the algorithm?

14. muni

Thanks for sharing. Very informative. but, could you please update me with any sample data

15. Pingback: Distilled News | Data Analytics & R
16. Pingback: Clustering With K-Means in Python | d@n3n | Sc...
17. fnokeke

Reblogged this on onetorch and commented:
Straightforward tutorial on K-Means Clustering.

18. Anonymous

Hello, how to visualize the iterative clustering steps with python? Thank you.

19. Pingback: Кластеризация с помощью метода k-средних на Python |
20. kevin

Do you know why I am getting the error “global name ‘zeros’ is not defined”?

Thanks

Kevin

21. Jiajun

Hi, Thanks for this amazing example.
But I really struggle to understand:
bestmukey = min([(i, np.linalg.norm(x – mu[i])) \
for i in enumerate(mu)], key = lambda t: t)
I know what it is trying to achieve: outputting the key (index) of mu, which has the shortest distance to x. I think my struggle is to understand the use of min() and enumerate(). Is it possible you can rewrite it in a normal loop? Appreciate it!!

J

22. richarddunks

Reblogged this on Datapolitan and commented:
A great overvew of k-Means clustering in Python with good visual explanations.

23. phil

yes, can you explain that bestmukey line for non python users
the goal is to explain KMeans, so why do you use such a non standard syntax, come on

24. Master

Does anyone know how to label the dots in the cluster ? I want to label some dots in the cluster, but I don’t know how.

• nico

@Master, I did asked sort of that question a while ago (July 24, 2014), without an answer.
After the clustering is done, my workaround consist of comparing each element in the ‘cluster’ dictionary to the input list where the data-to-id connection is known. The id gets transferred to a similar structured dictionary like ‘clusters’, which in the end is holding all the id’s. Might be far from elegant, but makes annotation via matplotlib easy…

25. mannyglover

I really appreciate this post, but can you post code to show how to visualize? I know it can’t be uber-complicated, but it would be nice to see the code that you used to visualize the data. Thanks.

26. P A

I appreciate some help. When I tried running the code, I get the error below. How should i be handling X to address this error message? I am afraid that I am new to both python and numpy. Very familiar with matlab.
Thanks!

ERROR:
line 33, in find_centers
oldmu = random.sample(X, K)
File “/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/random.py”, line 311, in sample
raise TypeError(“Population must be a sequence or set. For dicts, use list(d).”)
TypeError: Population must be a sequence or set. For dicts, use list(d).

• datasciencelab

The problem is compatibility between python2 (the code was written for python2) and your running python3.
To get it running in python3 please substitute
oldmu = random.sample(X, K)
by
random.sample(list(X), 3)

27. kalij

I need some help. When I tried running the code, I get the error below.
Can you explain, what the reason for it can be and how can i solve this problem?
In this part of the code:
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])

File “”, line 2
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])

SyntaxError: invalid syntax

Thanks!

• datasciencelab

There’s a missing closing parenthesis in the code. The correct version is:
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))

28. Pingback: Help Links – 23 March 2016 | Vijay Data Science
29. Kate32

Good example!
But how can I actually make a visualization?
You have attached these pictures, so can you show me the code, please?

thanks

30. Greg S

HI, how do I define mu at the start of the code? What should this be initializes to? Thanks

31. The Data Coder

Reblogged this on thedatacoder and commented:
Do check out this example of K-means implementation in Python!

32. boop

I’m also hoping to see the code for the visualization, that would be great… thanks!

33. sb

I ran the code for k=3 and the output was only two clusters. Is this normal in k-means?

• datasciencelab

No, it’s not. The algorithm is designed to produce k centroids with their k associated clusters. I’m not sure what could be happening there.

• sb

I think it is because of the very small data set that I was using. Thanks!

34. datasciencelab

Oh, right, clustering N points to K clusters works only if N >= K. 🙂

35. Alex Luya

Hello,@datasciencelab,you said “Sure! I just need to set up a convenient way to share”,but I can’t find it,did it miss something?

36. sb

Why are you passing ‘mu’ to the ‘reevaluate_centers’ function? It is not being used to calculate the ‘newmu’.

37. Nilesh

How did you plotted this?
I am using only one feature and facing issues during plotting.

38. ashleymaeconard

Hello, Could you provide the way to plot all of this? I can regenerate the results from the code but cannot plot. Thank you so much!

• Ole

As a beginner in python this is certainly not the best way to plot this, but at least I found a way…:
Starting from
X = init_board_gauss(1000, 4)
# number of centers n
n = 4
Y = find_centers(X, n)
# get the x and y for the centers and put them into a pandas dataframe
xcenter, ycenter = zip(*Y)
center = np.column_stack((xcenter, ycenter))
df_center = pd.DataFrame(data=center, columns=(‘x’, ‘y’))
# do the same for the clusters (dictionary with dataframes)
df_cluster = dict()
for i in range(0, (len(Y))):
print (i)
xtemp, ytemp = zip(*Y[i])
cluster = np.column_stack((xtemp, ytemp))
df_cluster[i] = pd.DataFrame(data=cluster, columns=(‘x’, ‘y’))

# and now plot it
for i in range(0, (len(df_cluster))):
plt.scatter(df_cluster[i].x, df_cluster[i].y)
plt.scatter(df_center.x, df_center.y, marker=’X’)
plt.show()

39. Pingback: Cluster unstructured text with Python – Avid Machine Learning
40. Pingback: Easily understand K-means clustering – Avid Machine Learning
41. fredo68

In the last example, honestly, I don’t see any cluster. The points look way to evenly distributed to show any concentration. We should not see clusters where there is none. Some techniques should be able to determine that.

42. windson

To people who wanna understand
for x in X:
bestmukey = min([(i, np.linalg.norm(x-mu[i])) \
for i in enumerate(mu)], key=lambda t:t)
X means all points you wanna cluster
for example:
[[1, 2], [2, 3], [3, 4]]

mu means the random points you choose from X
line27 mu = random.sample(X, K)

We break the code into serveral pieces
1. min(FOO, key=lambda t:t)
2. FOO = [(i, BAR) for i in enumerate(mu)]
3. BAR = np.linalg.norm(x-mu[i])

line3 is use for calculate the distance between two points.
We need to find out which point is nearest, so every x have to calculate the distance between all of the mu, line2 is equal to
lst = []
for i in enumerate(mu):
lst.append((i, BAR))
# The distance between the point x and the first mu is d1, the second is d2,
# So we got all_di = [(0, d1), (1, d2), (2, d3)],

x belongs to the nearest point, so line1 compare all_di by its second element, finallywe get the the index of points which x belongs to, like this
# finish for x in X:
{
0: [1,2,3],
1: [4,5,6],
2: [7,8,9]
}
point 1,2,3 belongs to the first cluster…

43. Lanelle Aylock

Simply a smiling visitant here to share the love (:, btw great design. “He profits most who serves best.” by Arthur F. Sheldon.