Tagged: python

Clustering With K-Means in Python

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling of X that assigns each of the points in X to a unique cluster. All points within a cluster are closer in distance to their centroid than they are to any other centroid.

The mathematical condition for the K clusters C_k and the K centroids \mu_k can be expressed as:

Minimize \displaystyle \sum_{k=1}^K \sum_{\mathrm{x}_n \in C_k} ||\mathrm{x}_n - \mu_k ||^2 with respect to \displaystyle C_k, \mu_k.

Lloyd’s algorithm

Finding the solution is unfortunately NP hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations. (1) Once a set of centroids \mu_k is available, the clusters are updated to contain the points closest in distance to each centroid. (2) Given a set of clusters, the centroids are recalculated as the means of all points belonging to a cluster.

\displaystyle C_k = \{\mathrm{x}_n : ||\mathrm{x}_n - \mu_k|| \leq \mathrm{\,\,all\,\,} ||\mathrm{x}_n - \mu_l||\}\qquad(1)

\displaystyle \mu_k = \frac{1}{C_k}\sum_{\mathrm{x}_n \in C_k}\mathrm{x}_n\qquad(2)

The two-step procedure continues until the assignments of clusters and centroids no longer change. As already mentioned, the convergence is guaranteed but the solution might be a local minimum. In practice, the algorithm is run multiple times and averaged. For the starting set of centroids, several methods can be employed, for instance random assignation.

Below is a simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

import numpy as np

def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters

def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        newmu.append(np.mean(clusters[k], axis = 0))
    return newmu

def has_converged(mu, oldmu):
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])

def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)
    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers
        mu = reevaluate_centers(oldmu, clusters)
    return(mu, clusters)

Clustering in action

Let’s see the algorithm in action! For an ensemble of 100 random points on the plane, we set the k-means function to find 7 clusters. The code converges in 7 iterations after initializing with random centers. In the following plots, dots correspond to the target data points X and stars represent the centroids \mu_k of the clusters. Each cluster is distinguished by a different color.

p_N100_K7

The initial configuration of points for the algorithm is created as follows:

import random

def init_board(N):
    X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)])
    return X

For a configuration with twice as many points and a target of 3 clusters, often the algorithm needs more iterations to converge.

p_N200_K3

Obviously, an ensemble of randomly generated points does not possess a natural cluster-like structure. To make things slightly more tricky, we want to modify the function that generates our initial data points to output a more interesting structure. The following routine constructs a specified number of Gaussian distributed clusters with random variances:

def init_board_gauss(N, k):
    n = float(N)/k
    X = []
    for i in range(k):
        c = (random.uniform(-1, 1), random.uniform(-1, 1))
        s = random.uniform(0.05,0.5)
        x = []
        while len(x) < n:
            a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
            # Continue drawing points from the distribution in the range [-1,1]
            if abs(a) < 1 and abs(b) < 1:
                x.append([a,b])
        X.extend(x)
    X = np.array(X)[:N]
    return X

Let us look at a data set constructed as X = init_board_gauss(200,3): 7 iterations are needed to find the 3 centroids.

p_N200_K3_G

If the target distribution is disjointedly clustered and only one instantiation of Lloyd’s algorithm is used, the danger exists that the local minimum reached is not the optimal solution. This is shown in the example below, where initial data using very peaked Gaussians is constructed:

p_N100_K9_G

The yellow and black stars serve two different clusters each, while the orange, red and blue centroids are cramped within one unique blob due to an unfortunate random initialization. For this type of cases, a cleverer election of initial clusters should help.

To finalize our table-top experiment on k-means clustering, we might want to take a look at what happens when the original data space is densely populated:

p_N2000_K15_

The k-means algorithm induces a partition of the observations corresponding to a Voronoi tessellation generated by the K centroids. And it is certainly very pretty!

Table-top data experiment take-away message

Lloyd’s two-step implementation of the k-means algorithm allows to cluster data points into groups represented by a centroid. This technique is employed in many facets of machine learning, from unsupervised learning algorithms to dimensionality reduction problems. The general clustering problem is NP hard, but the iterative procedure converges always, albeit to a local minimum. Proper initialization of the centroids is important. Additionally, this algorithm does not supply information as to which K for the k-means is optimal; that has to be found out by alternative methods.

Update: We explore the gap statistic as a method to determine the optimal K for clustering in this post: Finding the K in K-Means Clustering and the f(K) method: Selection of K in K-means Clustering, Reloaded.

Update: For a proper initialization of the centroids at the start of the k-means algorithm, we implement the improved k-means++ seeding procedure.

Put Some Pandas in Your Python

Image

Every scientist needs tools to perform their experiments, and data scientists are no exception. Much advice has been written to answer the question of what the best stack of tools for machine learning, data analysis and big data is. At least for the past few years, the champions in the programming language league seem to be R, python and SQL. SQL needs no introduction, being the standard tool for managing records held in relational databases. R is undeniably the language of choice for most data analysts and statisticians; alone the size of its community of users and the thousands of contributed libraries and add-on packages turn it into a reliable workhorse for a broad range of tasks. But one thing R has not, and it is a consistent syntax. To use the words of somebody else, the R language is pathologically flexible, and that might be very confusing to new users.

Python as a language for data science

Python, on the contrary, might not be king when it comes to statistics-specific problems; it is rather a general-purpose programming language. Luis Pedro Coelho summarizes it well:

Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

Still and all, Python is growing to include an increasingly mature set of packages for data manipulation and analysis. One that has gotten a respectable amount of attention lately is pandas, a library that offers data structures and operations for manipulating numerical tables and time series. Its creator, Wes McKinney, regularly posts materials and tutorials in his blog, and has written a very handy book on data analysis with python. In combination with the ipython notebook, pandas provides an easy environment to develop and present data analysis routines.

Data analysis with pandas: a beginners tutorial

Let us have a brief look into the basics of pandas with a beginners tutorial. Installation of both ipython and pandas is trivial via pip or anaconda, and superior to using the standard Linux apt-get install utility, since it guarantees more recent versions. The latest releases of pandas and ipython are 0.12.0 and 1.1.0 respectively. Once installed, the notebook starts in the browser when invoking the following command from the terminal:

$ ipython notebook --pylab inline

Pandas has very neat routines to load datasets in various formats, which can be browsed by typing pd.read*? in ipython. For this table-top experiment we will read a CSV file containing attributes of photo cameras, found in the Aviz project page. The data specifies model, weight, focal length and price, among others, and can be downloaded from the Aviz wiki.

We load the file indicating that the fields are separated by semicolons and rename the columns with descriptive labels for the variables. This creates a DataFrame, a fundamental structure that stores entries in tabular form, with an index (dataframe.index) for the rows and multiple columns for the different variables (dataframe.columns). Each column can be retrieved by dictionary-like notation (dataframe['column_name']) or by attribute (dataframe.column_name). The method head(n) applied to a data frame displays the first n rows of data (default = 5). The method apply(f) allows to apply a function to each column or row. For our camera dataset, we want to access also the maker, which corresponds to the first word of the model name. Thus, we create a new column 'Maker' by operating on the existing column 'Model'.

import pandas as pd
df = pd.read_csv('Camera.csv', sep=';')
columns = ['Model', 'Date', 'MaxRes', 'LowRes', 'EffPix', 'ZoomW', 'ZoomT',
           'NormalFR', 'MacroFR', 'Storage', 'Weight', 'Dimensions', 'Price']
df.columns = columns
df['Maker'] = df['Model'].apply(lambda s:s.split()[0])
df[['Maker','Model','Date','MaxRes','LowRes','Weight','Dimensions','Price']].head()
Maker Model Date MaxRes LowRes Weight Dimensions Price
0 Agfa Agfa ePhoto 1280 1997 1024 640 420 95 179
1 Agfa Agfa ePhoto 1680 1998 1280 640 420 158 179
2 Agfa Agfa ePhoto CL18 2000 640 0 0 0 179
3 Agfa Agfa ePhoto CL30 1999 1152 640 0 0 269
4 Agfa Agfa ePhoto CL30 Clik! 1999 1152 640 300 128 1299



A number of handy operations can be performed on a data frame with virtually no effort:

      • Sorting data: display 5 most recent models
        df.sort(['Date'], ascending = False).head()
      • Filtering columns by value: show only models made by Nikon
        df[df['Maker'] == 'Nikon']
      • Filtering columns by range of values: return cameras with prices above 350 and below 500
        df[(df['Price'] > 350) & (df['Price'] <= 500)]
      • Get statistical descriptions of the data set: find maxima, minima, averages, standard deviations, percentiles
        df[['MaxRes','LowRes','Storage','Weight','Dimensions','Price']].describe()
MaxRes LowRes Storage Weight Dimensions Price
count 1038.000000 1038.000000 1036.000000 1036.000000 1036.000000 1038.000000
mean 2474.672447 1773.936416 17.447876 319.265444 105.363417 457.384393
std 759.513608 830.897955 27.440655 260.410137 24.262761 760.452918
min 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000
25% 2048.000000 1120.000000 8.000000 180.000000 92.000000 149.000000
50% 2560.000000 2048.000000 16.000000 226.000000 101.000000 199.000000
75% 3072.000000 2560.000000 20.000000 350.000000 115.000000 399.000000
max 5616.000000 4992.000000 450.000000 1860.000000 240.000000 7999.000000


Plotting data frames with pandas

Pandas comes with handy wrappers around standard matplotlib routines that allow to plot data frames very easily. In the example below we create a histogram of all camera prices under 500 in our data set. Most camera models are priced in the bracket between 100 and 150 (units are not provided in the metadata of this set, but we might safely assume that we are talking about US dollars here).

matplotlib.rcParams.update({'font.size': 16})
df[(df['Price'] < 500)][['Price']].hist(figsize=(8,5), bins=10, alpha=0.5)
plt.title('Histogram of camera prices\n')
plt.savefig('PricesHist.png', bbox_inches='tight')

PricesHist

Grouping data with pandas

A powerful way of manipulating data in a frame is via processes that involve splitting it, applying functions to groups and combining the results. This is the “group by” approach that most SQL users will immediately recognize. Here is a clear exposition of how groupby works in pandas. In our table-top experiment, we group the data according to release year of the models, and study the evolution of the remaining variables with time. To the grouped structure gDate we apply the aggregating function mean(), which averages the non-grouped variables year by year. Quite reassuringly, we find out that photo cameras have become lighter over the past decade, while pixels and storage of newer models have steadily increased.

gDate = df[df['Date'] > 1998].groupby('Date').mean()
dates = [str(s) for s in gDate.index]
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(20,5))
cols = ['b', 'r', 'g']
vars = ['EffPix', 'Weight', 'Storage']
titles = ['effective pixels', 'weight', 'storage']
for i, var in enumerate(vars):
    gDate[[var]].plot(ax=axes[i], alpha=0.5, legend=False, lw=4, c=cols[i])
    axes[i].set_xticklabels(dates, rotation=40)
    axes[i].set_title('Evolution of %s\n' % titles[i])
plt.savefig('CameraEvolution.png', bbox_inches='tight')

CameraEvolution

Grouping by maker instead, we can see which ones make the smallest and most economical models, on average.

gMak = df.groupby('Maker').median()
gMak.index.name = ''
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18,8))
c = ['y','c']
vars = ['Dimensions', 'Price']
for i, var in enumerate(vars):
    gMak[[var]].plot(kind='barh', ax=axes[i], alpha=0.5, legend=False, color=c[i])
    axes[i].set_title('Average %s by maker\n' % vars[i])
    axes[i].grid(False)
plt.savefig('MeanDimensionsPrices.png', bbox_inches='tight')

MeanDimensionsPrices

Pandas provides many more functionalities, and this is just a first look at them here at The Data Science Lab. We will continue the exploration of this and other datasets in due time. In the meanwhile, a quick google search for pandas shall be enough to keep us entertained.