Unsupervised Learning - Clustering

Unsupervised machine learning is the machine learning task of inferring a function that describes the structure of "unlabeled" data (i.e. data that has not been classified or categorized). - Wikipedia

One important unsupervised learning algorithm is Clustering, which can find separate sets of points (clusters) of data. The applications of clustering are market segmentation (e.g. sell to the customers separately), social network analysis (e.g. find coherence in groups of people), organizing computing clusters (e.g. reorganizing resources, layout the network or how to design the data center communications), astronomical data analysis (e.g. understand galaxy information)

Common clustering algorithms includes K-means, Mean-Shift Clustering, DBSCAN, Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM), Agglomerative Hierarchical Clustering, etc. For now, I just talk about K-Means mainly and will complete other algorithms in the near future.


K-Means is an iterative algorithm including two steps: cluster assignment and move centroids. Just iterate until the optimal result.

Assume we have a two-dimension dataset, we want to cluster it into two separate sets. The steps are as follows:

  • 1.randomly initialize two points, called the cluster centroids
  • 2.assign each of the data point to one of the centroids based on the Euclidean distance (or squared Euclidean distance) to each centroid, picking the closer one.
  • 3.average(mean) each separate cluster to get a new centroid
  • 4.assign each of the data point to one of the new centroids
  • 5.iterate until the centroids don't change anymore.
  • 6.if there is one centroid without any points assigned, just eliminate it to have k-1 centroids or randomly reinitialize a new centroid. But it doesn't happen very often in practice.

To describe K-Means generally:

The optimization objective:

its main purpose is to help debug the model and to avoid local optima. The cost function(also called Distortion of K-Means)

The variables are partitioned into two sets: c and u. Minimize J with respect to the variables c and keep u fixed and the same the other way around.

What need to remember when you initialize K-Means:

Randomly pick K (K < m) training examples, then set centroid to these examples. So due to different initialization, you can get different solutions (the way of clustering). To avoid local optima (following picture) and get the best result. We need to run K-Means many times (i.e. randomly initialize many times (usually 50 - 1000 times)) and then pick the one with lowest cost among all the solutions.

Usually, if the K is small (like 2-10), initializing many time can make a difference. If the K is relatively big (like 100), then probably you can get the good result in the first run.

How to choose the number of clusters (K):

Usually, you get the right-hand-side situation, namely there isn't a elbow. Then you need to consider the downstream purpose, namely to satisfy customers to the most extend (straightforward to business).