DeepDPM: Deep Clustering With an Unknown Number of Clusters

{Non-parametric, Dirichlet Process Gaussian Mixture Model (DPGMM), Cluster Convex}

Paper: https://arxiv.org/pdf/2203.14309.pdf

Code:

Review: https://medium.com/syncedreview/meet-deepdpm-no-predefined-number-of-clusters-needed-for-deep-clustering-tasks-e7c635039013

Author's comment: https://www.reddit.com/r/MachineLearning/comments/tv9fuv/r_deepdpm_deep_clustering_with_an_unknown_number/

0) Motivation, Object and Related works:

Credit: [Link]

Motivation:

Most deep-clustering methods are parametric - require a predefined and fixed number of clusters, denoted by K.
K-selection is computationally expensive.

Objectives:

DeepDPM: removes the need to predefine the number of clusters, but infer K instead.
1. Combine the benefits of DL and the Dirichlet Process Mixture (DPM).
  - Use splits and merges of clusters to change K together with a dynamic architecture to accommodate for such changes.
  - Use a new loss function motivated by the expectation–maximization algorithm in Bayesian Gaussian mixture models (EM-GMM) to enable a novel amortized inference.
2. Can be incorporated in deep pipelines that rely on clustering (e.g., for feature learning).
3. Be differentiable during most of the training and thus supports gradient propagation through it.
4. Handles class imbalance gracefully and scales well to large datasets.
First report the performance of such a method on ImageNet.
Demonstrate the importance of inferring K, especially on imbalanced datasets.

Introduction:

Clustering problem:
1. no class labels.
2. no number of classes K.
3. no relative sizes (i.e., the class weights).
Classical clustering = non-parametric methods (methods that find K)
Deep clustering methods = non-parametric or parametric ones (methods that re-quire a known K)
The ability to infer the latent K:
1. Without a good estimate of K, parametric methods might suffer in performance. (Figure 1)
2. Changing K during training has positive optimization-related implications;
3. Using a model selection to find K: run a parametric method numerous times, using different K values over a wide range, and then choose the “best” K via an unsupervised criterion. ==> costly.
4. K itself may be a sought-after quantity of importance.
Bayesian nonparametric (BNP) mixture models, exemplified by the Dirichlet Process Mixture (DPM) model, offer an elegant, data-adaptive, and mathematically-principled solution for clustering when K is unknown.

Figure 1. Mean clustering accuracy of 3 runs (± std. dev.) on ImageNet50. Ground Truth K is 50.

Parametric methods such as K-means, DCN++ (an improved variant of [71]) and SCAN [64], require knowing K. When given a poor estimate of K, they deteriorate in performance in a balanced dataset (a) and even more so in an imbalanced dataset (b).
The proposed DeepDPM does not require knowing K (it infers its value; e.g., K = 55.3 ± 1.53 in (a) and 46.3 ± 2.52 in (b)) and yet yields comparable results.

Contribution:

A deep clustering method that infers the number of clusters.
A novel loss that enables a new amortized inference in mixture models.
A demonstration of the importance, in deep clustering, of inferring K.
Outperforms existing non-parametric clustering methods and be the first to report results of a deep non-parametric clustering method on a large dataset such as ImageNet.

Conclusion:

Limitations:
- If DeepDPM’s input features are poor it would struggle to recover.
- If K is known and the dataset is balanced, parametric methods (e.g., SCAN) may be a slightly better choice.
Future work:
- Adapting DeepDPM to streaming data (e.g., similarly to how [20]) or hierarchical settings [7,19,61].
- Our results may improve given a more sophisticated framework for building split proposals (e.g., see [67]).
Broader impact:
- Inspire the deep-clustering community to adopt the non-parametric approach.
- Raise awareness of issues with the parametric one.
- Non-parametric also has an environmentally positive impact: reduces resource usage.

Related Work:

Parametric Deep Clustering:

1) Two-step approaches: Clustering is performed on features extracted in a pretext task.

- - McConville et al. [47] run K-means on the embeddings, transformed by UMAP [48], of a pre-trained Auto-encoder (AE).
  - SCAN (reaching SOTA results), which uses unsupervised pretrained feature extractors (e.g., MoCo [13] and SimCLR [12]).
    - Being parametric, depends on having an estimate of K and, deteriorates in performance when the estimate is too inaccurate.
    - Assumes uniform class weights (i.e. a balanced dataset) and that is often unrealistic in purely-unsupervised cases.

2) End-to-end deep methods: jointly learn features and clustering, possibly by alternation.

- - [40, 68, 70–72]; use an AE, or a Variational AE (VAE), with an additional clustering loss.
  - DCN [71] runs K-means on the embeddings of a pre-trained AE, and retrains it with a loss consisting of a reconstruction term and a clustering-based term.
  - [5, 6], use convolutional neural nets to alternately learn features and clustering.
Non-parametric Classical Clustering:
- Ex: BNP clustering and, more specifically, the DPM model [1, 24].
- Works rely on BNP clustering [4, 9, 14, 25–28, 30, 32, 33, 38, 39, 41, 44–46, 49, 53–59, 62], it has yet to become a mainstream choice, partly due to the lack of efficient large-scale inference tools.
- Mile-stones:
  - The highly-effective DPM sampler [21] (a modern and scalable implementation of the DPM sampler from [10]).
  - The scalable streaming DPM inference [20].
  - Variational DPM inference [3, 31, 34, 36, 42].
- DBSCAN [23] (a non-Bayesian method) is density-based and groups together closely-packed points. (efficient implementations, but sensitive to hyper-parameters which are hard to tune).
Non-parametric Deep Clustering.
- Adaptively find K [11, 52, 66, 74].
  - Use an offline DPM inference for pseudo-labels for fine-tuning a deep belief network [11], or an AE [66] (similarly to the parametric methods in [5, 6, 71]).
  - [66] and [11] rely on slow DPM samplers, but do not scale to large datasets.
  - AdapVAE [74] uses a DPM prior for a VAE + ELBO minimization.
  - DCC [52], feature learning and clustering are performed simultaneously like in [74]; plus Nearest-neighbor graph to group points that are close in the latent space of an AE.
- [47], [65] uses an AE and t-SNE [63] to find K.
- [22], a deep net is simultaneously trained on a family of losses instead of a single one.
- [60] and [50] do not assume a known K, where [60] focuses on clustering faces and [50] on generating posterior samples of cluster labels for any new dataset.
- [2] iteratively forms clusters by sequentially examining each sample against the members of existing clusters.
- [73] relies on a BNP mixture, their method (and code) still uses a fixed K.

2) Methods:

Inspired by [10]

2.1 Preliminaries: DPGMM-based Clustering

Charles E Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 1974.
Thomas S Ferguson. A Bayesian analysis of some nonparametric problems. Annals of statistics, 1973.

2.2 The Proposed Method: DeepDPM

DeepDPM has two main parts:
1. A clustering net - generates soft cluster assignments for each input data point.
2. K sub-clustering nets (one for each cluster k, k ∈ {1, . . . , K}) - take the previously generated soft cluster assignments as inputs and generate soft sub-cluster assignments, which will later be used to support split and merge decisions to dynamically adapt to and change the number of clusters.

3) Results:

Metrics:

Clustering accuracy (ACC);
Normalized Mutual Information (NMI);
Adjusted Rand Index (ARI).
Silhouette Score

Overall:

uniformly achieves the best performance, reaching SOTA levels.
robust to both class imbalance and initial cluster value.
eliminates the need to repeatedly train deep parametric methods for model selection.
reduce resource usage.

Comparing with Classical Methods.
- We compared DeepDPM with classical parametric methods (K-means; GMM) and nonparametric ones (DBSCAN [23], moVB [34]; the SOTA DPM sampler from [21]). For feature extraction, we performed the process suggested in [47]. We performed the evaluation on the MNIST, USPS, and Fashion-MNIST datasets, as well their imbalanced versions (the latter are defined in the Supmat). All the methods used the same (and fixed) data embeddings as input, and the parametric ones were given the GT K, given them an unfair advantage. Table 1 shows that DeepDPM almost uniformly dominates across all datasets and metrics, and its performance gain only increases in the imbalanced cases. It is also observable that, compared with the parametric methods, the nonparametric ones (ours included) are less affected by the imbalance. Moreover, Table 2 shows that among the nonparametric methods, DeepDPM’s inferred K is the closest to the GT K (see Supmat for similar results in the imbalanced case).
Comparing with Deep Nonparametric Methods.
- As there exist very few deep nonparametric methods, and some of them reported results only on extremely-small toy datasets [11, 66] (e.g., one of them stated they could not process even MNIST’s train dataset as it was too large for them), we compared DeepDPM with DCC [52] and AdapVAE [74], the only unsupervised deep nonparametric methods that can at least handle the MNIST [18], USPS [35], and STL-10 [15] datasets. As both those methods jointly learn features and clustering, and to show the flexible nature of DeepDPM, we demonstrate its integration with two feature-extraction techniques (described in § 4.5): an end-to-end pipeline (for MNIST and REUTERS-10k [43]) and a two-step approach using features pretrained by MoCo [13] (for STL-10). Unfortunately, we could not run AdapVAE’s published code, and thus resort to including the results reported by them. For DCC, using their code we could reproduce their results only on MNIST, so we compare with both the results we managed to obtain using their code and the ones reported by them. Due to these reproducibility issues, we could compare with those methods only on the original (i.e., balanced) datasets. Table 3 shows that DeepDPM outperforms both DCC and AdapVAE. Note we could not find other unsupervised deep nonparametric methods (let alone with available code) that scale to even these fairly-small datasets.
Clustering the Entire ImageNet Dataset.
- On ImageNet, we obtained the following results: ACC: 0.25, NMI: 0.65, ARI: 0.14. Our method was initialized with K = 200 and converged into 707 clusters (GT=1000). These are first results on ImageNet reported for deep nonparametric clustering. Figure 3 shows examples of images clustered together.

The Value of Deep Nonparametric Methods
- When Parametric Methods Break. We study the effect of not knowing K on parametric methods, with and without class imbalance. We evaluate each method with a wide range of different K values on ImageNet-50. The latter, curated in [64], consists of 50 randomly-selected classes of ImageNet [17]. To generate an imbalanced version of it, we sampled a normalized nonuniform histogram from a uniform distribution over the 50-dimensional probability simplex (i.e., all histograms were equally probable) and then sampled examples from the 50 classes in proportions according to that nonuniform histogram. We compared with 3 parametric methods: 1) K-means; 2) the SOTA method SCAN [64]; 3) an improved version of DCN [71], self-coined DCN++, where instead of training an AE on the raw data, we trained it on top of the embeddings SCAN uses (MoCo [13]) where, following [64], we froze those embeddings during training. For DeepDPM, we used the same features.
- Since SCAN requires large amounts of memory (e.g., we could only run it on 2 RTX-3090 GPU cards with 24GB memory each, compared with DeepDPM for which a single RTX-2080 (or even GTX-1080) with 8GB sufficed), and due to resource constraints, we were limited in how many K values we could run SCAN with and in the number of times each experiment could run (this high computational cost is one of the problems with model selection in parametric methods). Thus, we collected the results of the parametric methods with K values ranging from 5 to 350. For both the balanced and imbalanced cases, we initialized DeepDPM with K = 10. Figure 1 summarizes the ACC results (see Supmat for ARI/NMI). As the K value used by the parametric methods diverges from the GT (i.e., K = 50), their results deteriorate. Unsurprisingly, when using the GT K, or sufficiently close to it, the parametric methods outperform our nonparametric one, confirming our claim that having a good estimate of K is important for good clustering. Figure 1a, however, shows that even with fairly-moderate deviates from the GT K, DeepDPM’s result (0.66±.01) surpasses the leading parametric method. Moreover, Figure 1 shows that the parametric SCAN is sensitive to class imbalance; e.g., in Figure 1b, SCAN performs best when K = 30 suggesting it is due to ignoring many small classes. In contrast, DeepDPM (scoring 0.60 ± .01) is fairly robust to these changes and is comparable in results to SCAN when the latter was given the GT K. In addition, we also show in Table 4 the performance of other nonparametric methods (3 runs on the same features as ours: MoCo+AE). We include DeepDPM’s results with alternation (between clustering and feature learning) and without (i.e., holding the features frozen and training DeepDPM only once). Table 5 compares the K values found by the nonparametric methods. DeepDPM inferred a K value close to the GT in both the balanced and imbalanced cases. In the imbalanced case, moVB scored a slightly better K but its results (see Table 4) were worse. For the parametric methods, Table 5 also shows the K value of the best silhouette score. The unsupervised silhouette metric is commonly used for model selection (NMI/ACC/ARI are supervised, hence inapplicable for model selection). As Table 5 shows, DeepDPM yielded a more accurate K than that approach
- Running Times. Our running time is comparable with a single run of SCAN (the SOTA deep parametric method); e.g., on ImageNet-50, SCAN (with 2 NVIDIA 3090 GPUs) trains for ∼8 [hr] while ours (with 1 weaker NVIDIA 2080 GPU) takes ∼11 [hr]. However, training SCAN multiple times with a different K each time (as needed for model selection) took more than 3 days. Thus, DeepDPM’s value and positive environmental impact are clear.
Ablation Study and Robustness to the Initial K
- Table 6 quantifies the performance gains due to the different parts of DeepDPM through an ablation study done on Fashion-MNIST (in the setting described earlier). It shows the effect of disabling splits, merges and both; e.g., merges help even when initializing with K = 3. In fact, the large moves made by splits/merges help even when Kinit = 10. Also, replacing the subclustering nets with K-means (using K = 2) results in deterioration. Likewise, either turning off the priors when computing the cluster parameters, or using an isotropic loss instead of Lcl, hurts performance and (while not shown here) often destabilizes the optimization. Finally, Figure 4 demonstrates, on three different datasets, DeepDPM’s robustness to the initial K.

References:

https://towardsdatascience.com/a-framework-for-contrastive-self-supervised-learning-and-designing-a-new-approach-3caab5d29619

Page updated

Google Sites

Report abuse

DeepDPM: Deep Clustering With an Unknown Number of Clusters

About Me: