Determining The Optimal Number Of Clusters: 3 Must-Know Methods
{Elbow method; Average Silhouette method; Gap method}
{Elbow method; Average Silhouette method; Gap method}
0) Motivation, Objectives and Related works:
Motivation:
Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, which requires the user to specify the number of clusters k to be generated.
The optimal number of clusters is somehow subjective and depends on the method used for measuring similarities and the parameters used for partitioning.
Objectives:
Describe different methods for determining the optimal number of clusters for k-means, k-medoids (PAM) and hierarchical clustering.
Direct methods: consist of optimizing a criterion, such as the within-cluster sums of squares or the average silhouette. The corresponding methods are named elbow and silhouette methods, respectively.
Statistical testing methods: consists of comparing evidence against the null hypothesis. An example is the gap statistic.
1) Dataset and Metrics:
2) Methods:
Figure. Sample Results of All Methods [Source]
(1) Elbow Method:
Ideas:
The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized.
The total WSS measures the compactness of the clustering and we want it to be as small as possible.
==> The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.
Method:
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
For each k, calculate the total within-cluster sum of square (wss).
Plot the curve of wss according to the number of clusters k.
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
(2) Average Silhouette Method:
Ideas:
It determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
Average silhouette method computes the average silhouette of observations for different values of k.
The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw 1990).
[Paper]: Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis
Method:
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
For each k, calculate the average silhouette of observations (avg.sil).
Plot the curve of avg.sil according to the number of clusters k.
The location of the maximum is considered as the appropriate number of clusters.
(3) Gap Statistic Method:
Ideas:
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.
The estimate of the optimal clusters will be a value that maximizes the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
[Paper]: Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)
Method:
Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wk.
Generate B reference data sets with a random uniform distribution. Cluster each of these reference datasets with varying number of clusters k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wkb.
Compute the estimated gap statistic as the deviation of the observed Wk value from its expected value Wkb under the null hypothesis Gap(k) [Appendix 1]. Compute also the standard deviation of the statistics.
Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1: Gap(k) ≥ Gap(k + 1) − sk+1.
[Appendix 1] The null hypothesis:
References:
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software 61: 1–36. http://www.jstatsoft.org/v61/i06/paper.
Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.