WiseAnalytics | ON Coding K-Means in Vanilla Python

Insights

ON Coding K-Means in Vanilla Python

4 min read

By Julien Kervizic

K-means is an algorithm to assign clusters to different points. One of its’ application is to provide an automated segmentation and micro-segmentation method based on behavioral data. These segments can then used to provide recommendation or targeted offers to the different users’ within the segments such as previously described in a previous blog post.

The algorithm takes two main inputs, a list of points and a fixed number of clusters and provides two different outputs, a cluster assignment for each point, a cluster center for each of the computed clusters. This blog post shows how to implement this algorithm from scratch using vanilla python.

Sklearn API

If we look at the Sklearn API as how it sets up the inputs for K-means:

K-Means is instantiated by providing the input number of clusters to compute, and the training is being computed when calling the fit method on a list of points, numpy array or pandas data-frame.

The output of K-means with sklearn are a predict method for cluster assignment and a cluster_centers_ variable in order to obtain the center point of each computed clusters.

Class Skeleton

if we were to re-apply these API calls as a framework to our vanilla python implementation we would have the skeleton of the class as:

Points Classes and Functions

Above is represented a potential point class, it is used to store the methods and data of every-point meant to be included in the k-means computation. The class makes heavy use of python operators in order to simplify some of the operations that will occur downstream, such as the computation of means, of distances or of equality between two points.

A Cluster point is defined as a point assignable to a given cluster.

The k-means algorithms relies in a notion of distance in order to tie each point to its closest mean-point. The distance function shown above is a calculates the euclidean distance between two 2D points. While the closest function return the indice of the first point in a list of points to be the point the nearest to a given point p.

Fit Method

Let’s first focus on what the fit method should do, if we look at how the k-means algorithm is meant to behave:

Assignment Step: It takes a new point at random for each of the n clusters that have been defined in k-means and assigns it to that cluster
Update Step: For each of the points it assigns a point to a cluster based on it being the closest cluster and that until a certain stopping criteria is being met, usually a number of iteration or a convergence condition (the cluster means don’t change)

Let’s look at how to implement the assignment step within the fit method:

What the code does is that until we have assigned the required number of initial cluster points, we assign to a cluster one point at random to be its’ center. These end up being push to the variable cluster_centers_ for further use. We then initialize every point by assigning it to its closest cluster point.

For all the point in the data set, do a first pass initialization and assign the point to a given cluster.

Assign each point to its closest cluster mean (lcp)
Recalculate the mean of each clusters (cluster_centers_), in the above function the mean is calculated using list comprehension filtering and the operators of the point class
Break if the stop condition is met otherwise go for another round of the loop. The stop condition being put there is either 100 iterations or a constant mean and 1 or more iteration.

Setting up everything together gives the above fit method.

Predict Method

The predict functions re-uses part of the code that was used during the fit/training process. It converts each point in a list of point into a cluster point assigned to its’s nearest cluster mean.

Wrapping up

Setting up K-Means as describe above allows us to run the fit method on a List of points and compute the different cluster centers.

The predict API when provided with a list of point inputs, returns a list of cluster points with an assignment to their closest k-means.

We were able to create an implementation of k-means using vanilla python modeled after the sklearn api, while it may not show the performance of an implementation relying on numpy or pandas and is unoptimized, it allows for an understanding of the different steps required to implement the K-Means algorithm.

‍

Sitemap

Cookie Preferences