[SwAV] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
{Online Clustering}
Paper: https://arxiv.org/pdf/2006.09882.pdf
Code:
{Online Clustering}
Paper: https://arxiv.org/pdf/2006.09882.pdf
Code:
Unsupervised image representations have significantly reduced the gap with supervised pre-training, notably with the recent achievements of contrastive learning methods.
These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging.
Propose SwAV:
Takes advantage of contrastive methods without requiring to compute pairwise comparisons.
Simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning.
Use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view.
The method can be trained with large and small batches and can scale to unlimited amounts of data.
Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network.
Propose Multi-crop:
A new data augmentation strategy.
Uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.
Deep Cluster: perform clustering and training iteratively in offline manner, where training is to cluster assignment, i.e. the codes predicted for different image views. However, large amount of forward pass is needed for each iterations.
The same approach as AMDIM (using only the last feature map), but instead of comparing the vectors directly against each other, they compute the similarity against a set of K precomputed codes.
In practice, this means that Swav generates K clusters and for each encoded vector, it compares against those clusters to learn new representations. This work can be viewed as mixing the ideas of AMDIM and Noise as Targets.
Firstly, the “codes” are obtained by assigning features to prototype vectors. (Prototype vectors are learned along with the ConvNet parameters by back-propragation).
Then, a “swapped” prediction problem is solved wherein the codes obtained from one data-augmented view are predicted using the other view.
Given two image features zt and zs from two different augmentations of the same image, their codes qt and qs are computed by matching these features to a set of K prototypes {c1, …, cK}.
Then, a “swapped” prediction problem is setup with the following loss function:
Where the function l(z, q) measures the fit between features z and a code q.
Each image xn is transformed into an augmented view xnt.
The augmented view is mapped to a vector representation by fθ.
The feature is then projected to the unit sphere:
A code qnt is then computed from this feature by mapping znt to a set of K trainable prototypes vectors, {c1, …, cK}, where C is the matrix whose columns are the c1, …, ck.
The Swapped Prediction Problem has two terms.
Each term is the cross entropy loss between the code and the probability obtained by taking a softmax of the dot products of zi and all prototypes in C.
where τ is a temperature parameter.
Taking this loss over all the images and pairs of data augmentations leads to the following loss function for the swapped prediction problem:
This loss function is jointly minimized for C and θ.
Intuitively, as the prototypes C are used across different batches, SwAV clusters multiple instances to the prototypes.
Codes are computed using the prototypes C such that all the examples in a batch are equally partitioned by the prototypes. This equipartition constraint ensures that the codes for different images in a batch are distinct, thus preventing the trivial solution where every image has the same code.
Given B feature vectors Z = [z1, …, zB], we are interested in mapping them to the prototypes C = [c1, …, cK].
This mapping or codes are denoted by Q = [q1, …, qB], and Q is optimized to maximize the similarity between the features and the prototypes:
where H is the entropy function and ε is smoothness parameter.
Authors adopt SeLA SSL approach, i.e. Sinkhorn Distance, to work on minibatches by restricting the transportation polytope to the minibatch:
where 1K denotes the vector of ones in dimension K. These constraints enforce that on average each prototype is selected at least B/K times in the batch.
These soft codes Q* are the solution of Eq. 3 over the set Q and take the form of a normalized exponential matrix:
where u and v are renormalization vectors with the size of K and B respectively.
In practice, when working with small batches, features from the previous batches are used to augment the size of Z, and around 3K features are stored.
Only features from the last 15 batches are kept with a batch size of 256, while contrastive methods typically need to store the last 65K instances obtained from the last 250 batches.
Augmenting views with smaller images.
A multi-crop strategy is proposed where two standard resolution crops are used and V additional low resolution crops are used that cover only small parts of the image. Using low resolution images ensures only a small increase in the compute cost.
The loss is generalized to:
Codes are only computed for the full-resolution crops.
ResNet-50
Outperforms the state-of-the-art by +4.2% top-1 accuracy.
Below 1.2% below a fully supervised model.
Outperforms other self-supervised methods and is on par with state-of-the-art semi-supervised models.
Left: For linear classification performance on the Places205, VOC07, and iNaturalist2018 datasets, SwAV outperforms supervised features on all three datasets. Note that SwAV is the first self-supervised method to surpass ImageNet supervised features on these datasets.
Right: For object detection fine-tuning, SwAV outperforms the supervised pretrained model on both VOC07+12 and COCO datasets.
SwAV only stores a queue of 3840 features, it maintains state-of-the-art performance even when trained in the small batch setting.
SwAV learns much faster and reaches higher performance in 4× fewer epochs.
Multi-crop strategy consistently improves the performance for all the considered methods by a significant margin of 2-4% top-1 accuracy.
Outperforms training from scratch by a significant margin.
Benefits from training on large architectures.
Advantage:
Solving the problem of usual Contrastive Learning methods.
With the normal method, each sample is considered a specific class in the dataset. Therefore, a negative sample still can belong to the same class as the input image. This is a false negative case.
Ex: The input image is a cat and other images in the same batch will be negative. The problem happens when one or several images in that batch are a cat. In this case, the model is compulsorily learned that 2 images of 2 cats are different while they must belong to the same class. This can cause degradation to the embedding vector.
Disadvantage:
n2 n0
θ