K-means is an algorithm to assign clusters to different points. One of its’ application is to provide an automated segmentation and micro-segmentation method based on behavioral data. These segments can then used to provide recommendation or targeted offers to the different users’ within the segments such as previously described in a previous blog post.
The algorithm takes two main inputs, a list of points and a fixed number of clusters and provides two different outputs, a cluster assignment for each point, a cluster center for each of the computed clusters. This blog post shows how to implement this algorithm from scratch using vanilla python.
If we look at the Sklearn API as how it sets up the inputs for K-means:
K-Means is instantiated by providing the input number of clusters to compute, and the training is being computed when calling the fit method on a list of points, numpy array or pandas data-frame.
The output of K-means with sklearn are a predict method for cluster assignment and a cluster_centers_ variable in order to obtain the center point of each computed clusters.
if we were to re-apply these API calls as a framework to our vanilla python implementation we would have the skeleton of the class as:
Above is represented a potential point class, it is used to store the methods and data of every-point meant to be included in the k-means computation. The class makes heavy use of python operators in order to simplify some of the operations that will occur downstream, such as the computation of means, of distances or of equality between two points.
A Cluster point is defined as a point assignable to a given cluster.
The k-means algorithms relies in a notion of distance in order to tie each point to its closest mean-point. The distance function shown above is a calculates the euclidean distance between two 2D points. While the closest function return the indice of the first point in a list of points to be the point the nearest to a given point p.
Let’s first focus on what the fit method should do, if we look at how the k-means algorithm is meant to behave:
Let’s look at how to implement the assignment step within the fit method:
What the code does is that until we have assigned the required number of initial cluster points, we assign to a cluster one point at random to be its’ center. These end up being push to the variable cluster_centers_ for further use. We then initialize every point by assigning it to its closest cluster point.
For all the point in the data set, do a first pass initialization and assign the point to a given cluster.
Setting up everything together gives the above fit method.
The predict functions re-uses part of the code that was used during the fit/training process. It converts each point in a list of point into a cluster point assigned to its’s nearest cluster mean.
Setting up K-Means as describe above allows us to run the fit method on a List of points and compute the different cluster centers.
The predict API when provided with a list of point inputs, returns a list of cluster points with an assignment to their closest k-means.
We were able to create an implementation of k-means using vanilla python modeled after the sklearn api, while it may not show the performance of an implementation relying on numpy or pandas and is unoptimized, it allows for an understanding of the different steps required to implement the K-Means algorithm.