# Machine learning

## k-means clustering

### Read on $k$-means, its implementation, drawbacks and possible solutions, and an example of its applications below.

Click the plot area to add/remove points or upload data from a CSV file:

### Introduction

One of the simplest things you can do to make sense out of a lump of data is to divide it into smaller lumps (called clusters), grouping similar datapoints together. This process is called clustering and is an example of unsupervised learning. After dividing your data into clusters, you can label them (e.g. “Category A”, “Category B”, etc.) and start thinking about which features are characteristic of which category. This, in turn, is called supervised learning, the second cornerstone of the insanely (as of 2016) fashionable intersection of statistics, computer science and artificial intelligence, known as machine learning. Below, we describe in simple terms a very popular clustering algorithm called $k$-means.

### Algorithm

Let's try to keep mathematical notation to the minimum, but some is still necessary. The data being clusterised is a set $S$ of $N$ points, each described by $D$ coordinates. The $i$-th datapoint is denoted by $\vec{x}_i$ ($0 \le i < N$) and its $j$-th coefficient by $x_{i,j}$ ($0 \le j < D$). We determine how close or far away are two points $\vec{x}_i$ and $\vec{x}_{i'}$ by calculating their Euclidean distance $d(\vec{x}_i, \vec{x}_{i'}) := \sqrt{\sum_{j=0}^{D-1} (x_{i,j} - x_{i',j})^2} \ .$ Our aim is to divide the dataset $S$ into $K$ clusters, where usually $K \ll N$. The $k$-th cluster is denoted by $C_k$. An important characteristic of a cluster is the average location of all its points, called the centroid: \begin{equation} \label{eq:centroid} \vec{c}_k := \frac{1}{\left| C_k \right|} \sum_{\vec{x} \in C_k} \vec{x} \ , \end{equation} where $\left| C_k \right|$ is the number of points in cluster $C_k$. The centroid is, roughly speaking, the location of the cluster's “centre of mass”.

The aim of the $k$-means algorithm is to divide the dataset in such way that every point belonging to cluster $C_k$ lies closer to its centroid than to any other one. If this is true, we congratulate ourselves of a job well done and go home. If not, we reassign the points to the clusters with centroids nearest to them: $\vec{x}_i \rightarrow \argmin_{0 \le k < K} \ d(\vec{x}_i, \vec{c}_k) \ .$ The ties (points lying at equal minimal distance $d$ to more than one centroid) should be broken randomly. After reassigning the points, we recalculate the centroids according to Eq. \eqref{eq:centroid} and check how much they have moved from their previous positions. These two steps (reassign and recalculate) should be repeated until convergence is reached, i.e. the centroids stop moving or move very little.

In our description above we ignored one important question: how to initialise the centroid values? There are many different ways. Below we will describe just three of them.

The Forgy algorithm chooses $K$ random points from the dataset and uses them as centroids. Obviously, if we select two points close to each other as centroids, this is not going to lead to a nice result, but this is not going to happen too often. It is a decent method if you want to solve standard clustering problems without much coding effort.

This approach assigns each point randomly to one of the $K$ clusters and then calculates the centroids using Eq. \eqref{eq:centroid}. In a typical case, it will lead to all centroids being close to each other in the centre of the dataset. Not a good choice, usually.

As the “++” suffix suggests, this method is a more serious affair. Like Forgy, it selects $K$ datapoints as initial centroids, but takes more care to space them wide apart. The first centroid is chosen entirely randomly (uniformly over $S$). After that, the $k$-th centroid is chosen using a probability distribution $P_k(\vec{x})$ with probabilities proportional to the square of the point's distance from the closest already chosen centroid: $P_k(\vec{x}) \sim \min_{0 \le l < k} d^2 (\vec{x}, \vec{c}_l) \ .$ This method tends to produce initial centroid locations more spread out than Forgy.

### How many clusters?

The $k$-means algorithm described above assumes that the number of clusters $K$ is given in advance, and we just need to find their locations. However, we often don't know how many clusters there are in the data either. A statistically sound way to answer this question is using the so-called gap statistic, described below.

For a given number of clusters $K$, the $k$-means algorithm minimizes the within-cluster dispersion (also called within-cluster sum of squares), defined as $\sum_{k=0}^{K-1} \sum_{\vec{x} \in C_k} d^2 (\vec{x}, \vec{c}_k) \ .$ Its minimum over all possible partitions of the dataset $S$ into $K$ clusters is denoted by $W_K$ (this statistic is also a function of the dataset $S$, but we do not reflect that in the notation to keep it simpler). $W_K$ is a monotonically decreasing function of $K$, which means that the more clusters we have, the closer every datapoint lies to its centroid.

Traditionally, a non-rigorous method of determining the appropriate number of clusters was to plot $W_K$ and choose the value of $K$ over which $W_K$ flattens out at the optimal cluster number $\Kopt$, the so-called “elbow point”. An example of such a procedure is presented in the neighbouring graph.

The shape of the curve $W_K$ strongly suggests that the optimal cluster number $\Kopt=2$, but how can we tell this without looking at the plot? The gap statistic is a more refined and rigorous take on this simple heuristic method.

The fact that $W_K$ decreases monotonically by construction suggests that we can try to “detrend”. To this end, we compare $W_K$ with a “reference” value $W'_K$ calculated for a sample of size $N$ drawn from a known simple reference distribution $\mathcal{R}$, which does not to exhibit any clustering. The latter is usually assumed to be a uniform distribution over the minimal hypercube encapsulating the data. If we want to adjust it better to our dataset, we can orient the hypercube along the eigenvectors of its covariance matrix. Just as $W_K$, $W'_K$ decreases monotonically with $K$. However, as we increase $K$ to bring it closer to the optimal value $\Kopt$, $W_K$ decreases faster than $W'_K$ because at least some of the datapoints $\vec{x} \in S$ will be assigned to their “correct” centroids, decreasing their distances from them by more than in the case of the reference distribution. The below graphs illustrate this effect. The within-cluster dispersion decreases with the increase of $K$ from 1 to 2 for clustered dataset $S$ and reference distribution $\mathcal{R}$ sample. Dashed lines link datapoints to their cluster's centroids (their colours correspond to cluster assignments). We obtain $\Delta W'_1 - \Delta W_1 \approx 0.84 > 0$, which is consistent with the fact that $\Kopt = 2$.  To remove the effects of distance scale (e.g. whether we measure the distance between datapoints in miles or millimetres), it is better to compare the changes in $\ln W_K$ and $\ln W'_K$, expecting $\Delta \ln W'_K - \Delta \ln W_K > 0$ when $K + 1$ is closer to $\Kopt$ than $K$. Another motivation for the logarithm is the fact that for data sampled from a mixture of Gaussian distributions, $-\ln W_K$ is proportional to log-likelihood.

Since $\ln W'_K$ is calculated from a random sample, it is a random variable itself - it has a non-zero variance. Hence, we need to make sure that we do not change our estimate of $\Kopt$ from $K$ to $K+1$ without a good reason. For this, we require not only that $\ln W_K$ decreases faster than $\ln W'_K$ for a single sample from $\mathcal{R}$, or that it is likely to decrease faster. We ask that it is unlikely not to decrease faster. Thus, we define the gap statistic announced at the beginning of this section as $G_K := E_{\mathcal{R}}[\ln W'_K] - \ln W_K \ ,$ where $E_{\mathcal{R}}[\ln W'_K]$ is the expected value of $\ln W'_K$ under distribution $\mathcal{R}$. Starting from $K=1$, we accept $K+1$ over $K$ only if $G_K < G_{K+1} - \sqrt{\Var_{\mathcal{R}}[\ln W'_K]} \ ,$ where $\Var_{\mathcal{R}} [\ln W'_K]$ is the variance of $\ln W'_K$ under $\mathcal{R}$. The optimal cluster number $\Kopt$ is assumed to be the first one for which this inequality does not hold.

In practice, we need to calculate the expected value and variance using Monte Carlo methods. We approximate $E_{\mathcal{R}}[\Delta W'_K]$ by an average $\bar{\ln W'_K}$ taken over $B$ samples from $\mathcal{R}$ (a typical value for $B$ is 100-300), and $\Var_{\mathcal{R}} [\ln W'_K]$ by the corresponding sample variance $\sigma^2(\ln W'_K)$, multiplied by $1 + 1 / B$ to account for the sampling error.

### Timorous beasties

Although the $k$-means clustering gained popularity thanks to its simple and intuitive idea, it has its shortcomings. They are exposed by the below “mouse” example. The artificial dataset consists of three circular clusters of different sizes. Since the $k$-means method splits datapoints halfway between clusters, and hence favours clusters of similar sizes, it fails to find an acceptable solution. The problem can be remedied if instead of looking at individual datapoints and their mutual distances, we focus on clusters. Each of them is characterised by an area and the density of datapoints it contains. Assigning datapoints to clusters should cause a minimum increase of the cluster area and a minimum decrease of its density. This criterion leads to a new clustering method, $k$-dense clustering (to be published soon).

Agnieszka Werpachowska, 15 November 2017.