0) Motivation, Objectives and Related works:
Motivation:
Address the challenging problem of clustering face tracks based on their identity.
Choose to operate in a realistic and difficult setting where:
(i) the number of characters is not known a priori;
(ii) face tracks belonging to minor or background characters are not discarded.
Objectives:
Propose Ball Cluster Learning (BCL)
A supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms.
This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets.
Also present a thorough discussion of how existing metric learning literature can be adapted for this task.
In this paper, we propose Ball Cluster Learning (BCL) - a supervised approach to carve the embedding space into equal-sized balls such that samples within a ball belong to one cluster. In particular, we formulate learning constraints that create such a space and show how the ball radius (also learned) can be associated with the stopping criterion for agglomerative clustering to estimate both the number of clusters and assignment (Sec. 3). We demonstrate BCL on video face clustering in a setup where we are unaware of the number of characters, and all face tracks, main character or otherwise, are included (Sec. 4). Thus, BCL is truly applicable to all videos as it does not place assumptions on availability of cast lists (to determine number of clusters) or track labels (to discard background characters). To evaluate our approach, we augment standard datasets used in video face clustering by resolving labels between all background characters. Our approach achieves promising results in estimating the number of clusters and the cluster assignment. We also present a thorough analysis of commonly used loss functions in verification (e.g. contrastive loss), compare them against BCL, and discuss how and when they may be suitable for clustering. To the best of our knowledge, BCL is the first approach that learns a threshold to estimate the number of clusters at test time. Code and data are available at Github.
We presented Ball Cluster Learning - a supervised approach to carve the representation space into balls of an equal radius. We showed how the radius is related to the stopping criterion used in agglomerative clustering methods, and evaluated this approach for clustering face tracks in videos. In particular, we considered a realistic setup where the number of clusters is not known, and tracks from all characters (main or otherwise) are included. We reviewed several metric learning approaches and adapted them to this clustering setup. BCL shows promising results, and to the best of our knowledge is the first approach that learns a threshold that can be used directly to estimate the number of clusters.
Related Works:
Understanding characters also has a direct influence on important research such as video captioning [34, 35], question-answering [22, 44], studying social situations [45] and 4D effects [56].
Characters are often studied by analyzing face tracks (sequences of temporally related detections) in videos.
A significant part is identification - labeling face tracks with their names, and typically employs supervision from web images [1, 29], transcripts [3, 9], or even dialogs [7, 15].
While there exists a large body of work in video face clustering (e.g. [6, 18, 55])
A simplified setup where background characters1 are ignored and the total number of characters is known.
With recent advances in face representations [4], their application towards clustering [38], and the ability to learn cast-specific metrics by looking at overlapping face tracks [6].
We encourage the community to address the challenging problem of estimating the number of characters and not ignoring background cast (see Fig. 1).
1)
Definitions:
K-Means performs three steps. But first you need to pre-define the number of K. Those cluster points are often called Centroids.
1) (Re-)assign each data point to its nearest centroid, by calculating the euclidian distance between all points to all centroids.
2) Calculate the mean for each centroid based on all respective data points and move the centroid in the middle of all his assigned data points.
3) Go to 1) until the convergence criterion is fulfilled. In my case, I calculate the within-cluster distance between all points to the re-assigned centroid mean. After a new iteration, if all centroids together moved less than 0.01, so basically nothing happens anymore, the convergence criterion is performed.
References: