When Parametric Methods Break. We study the effect of not knowing K on parametric methods, with and without class imbalance. We evaluate each method with a wide range of different K values on ImageNet-50. The latter, curated in [64], consists of 50 randomly-selected classes of ImageNet [17]. To generate an imbalanced version of it, we sampled a normalized nonuniform histogram from a uniform distribution over the 50-dimensional probability simplex (i.e., all histograms were equally probable) and then sampled examples from the 50 classes in proportions according to that nonuniform histogram. We compared with 3 parametric methods: 1) K-means; 2) the SOTA method SCAN [64]; 3) an improved version of DCN [71], self-coined DCN++, where instead of training an AE on the raw data, we trained it on top of the embeddings SCAN uses (MoCo [13]) where, following [64], we froze those embeddings during training. For DeepDPM, we used the same features.
Since SCAN requires large amounts of memory (e.g., we could only run it on 2 RTX-3090 GPU cards with 24GB memory each, compared with DeepDPM for which a single RTX-2080 (or even GTX-1080) with 8GB sufficed), and due to resource constraints, we were limited in how many K values we could run SCAN with and in the number of times each experiment could run (this high computational cost is one of the problems with model selection in parametric methods). Thus, we collected the results of the parametric methods with K values ranging from 5 to 350. For both the balanced and imbalanced cases, we initialized DeepDPM with K = 10. Figure 1 summarizes the ACC results (see Supmat for ARI/NMI). As the K value used by the parametric methods diverges from the GT (i.e., K = 50), their results deteriorate. Unsurprisingly, when using the GT K, or sufficiently close to it, the parametric methods outperform our nonparametric one, confirming our claim that having a good estimate of K is important for good clustering. Figure 1a, however, shows that even with fairly-moderate deviates from the GT K, DeepDPM’s result (0.66±.01) surpasses the leading parametric method. Moreover, Figure 1 shows that the parametric SCAN is sensitive to class imbalance; e.g., in Figure 1b, SCAN performs best when K = 30 suggesting it is due to ignoring many small classes. In contrast, DeepDPM (scoring 0.60 ± .01) is fairly robust to these changes and is comparable in results to SCAN when the latter was given the GT K. In addition, we also show in Table 4 the performance of other nonparametric methods (3 runs on the same features as ours: MoCo+AE). We include DeepDPM’s results with alternation (between clustering and feature learning) and without (i.e., holding the features frozen and training DeepDPM only once). Table 5 compares the K values found by the nonparametric methods. DeepDPM inferred a K value close to the GT in both the balanced and imbalanced cases. In the imbalanced case, moVB scored a slightly better K but its results (see Table 4) were worse. For the parametric methods, Table 5 also shows the K value of the best silhouette score. The unsupervised silhouette metric is commonly used for model selection (NMI/ACC/ARI are supervised, hence inapplicable for model selection). As Table 5 shows, DeepDPM yielded a more accurate K than that approach