0) Motivation, Objectives and Related works:
Objectives:
In this post, I will be covering all the latest clustering techniques which leverage deep learning.
The goal of most of these techniques is to cluster the data-points such that the data-points of the same ground truth class are assigned the same cluster.
The deep learning-based clustering techniques are different from traditional clustering techniques as they cluster the data-points by finding complex patterns rather than using simple pre-defined metrics like intra-cluster euclidean distance. [Link]
Non-parametric deep clustering: methods which utilize deep clustering when the number of clusters is not known apriorly and needs to be inferred.
1) Dataset and Metrics:
Image Dataset:
MNIST: Consists of 70000 images of hand-written digits of 28 × 28 pixel size. The digits are centered and size is normalized (LeCun, 1998).
[USPS] J. J Hull. "A database for handwritten text recognition research". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, I. 5, pp. 550-554, 1994.
[Fashion-MNIST] H Xiao, K Rasul, R Vollgraf, "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms", arXiv preprint arXiv:1708.07747, 2017.
[COIL20]: Contains 1440, 32 × 32 grayscale images of 20 objects. For each object, 72 images were taken with a 5 degrees distance (Nene et al., 1996).
[STL10] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ık, editors, AISTATS, 2011.
[Reuter10k] David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 2004
[ImageNet] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009
COIL-20 http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
CMU PIE http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html
Yale-B http://vision.ucsd.edu/~leekc/ExtYaleDatabase/Yale%20Face%20Database.htm
MNIST http://yann.lecun.com/exdb/mnist/index.html
CIFAR http://www.cs.toronto.edu/~kriz/cifar.html
STL-10
Metrics:
Validation and Assessment:
Clustering Validation: Evaluate the goodness of the clustering
External - Supervised: Employ criteria not inherent to the dataset.
Internal - Unsupervised: Criteria derive from data itself
Relative: Compare different clustering, usually those obtained via different parameter settings fro the same algorithm.
Clustering Stability: Understand the sensitivity of the clustering result to various algorithm parameters. (to nb. of cluster)
Clustering Tendency: Assess the suitability of clustering (whether data has any inherent grouping structure)
External - Supervised Metrics:
[ACC] Clustering accuracy - range (0,1)
[NMI] Normalized mutual information - range (0,1)
Strehl, A. and Ghosh, J. (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583–617.
Vinh, N. X., Epps, J., and Bailey, J. (2010). Information-theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837–2854.
Cai, D., He, X., and Han, J. (2011). Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6):902–913.
[ARI] Adjusted Rand Index - range (-1,1)
Purity:
Maximum Matching:
F-Measure
Internal - Unsupervised Metrics:
[Sil] Silhouette Score - range (-1,1)
Relative:
[Sil] Silhouette Score - range (-1,1)
2) Methods:
Similarity/ Distance Calculation Methods:
A similarity measure is a distance measure of a similarity relationship.
AKA: Affinity Measure, Relatedness Function.
The Measurement Indicators of Clustering:
External indicators refer to indicators that need to be compared and analyzed with the help of the actual situation of the data during the evaluation process.
Internal indicators refer to indicators that can be evaluated without other data.
References: