[DINO] Emerging Properties in Self-Supervised Vision Transformers
Facebook AI Research_ICCV_2021
{Distillation, Teacher-Student}
Facebook AI Research_ICCV_2021
{Distillation, Teacher-Student}
Self-supervised method is applied onto Vision Transformer (ViT), which forms DINO, a form of self-DIstillation with NO labels.
It is found that the self-supervised ViT features contain explicit information about the semantic segmentation of an image, as shown above. And the extracted features are also excellent k-NN classifiers.
DINO directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss.
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) [19] that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.
Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder [33], multi-crop training [10], and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
Figure. Self-attention from a Vision Transformer with 8×8 patches trained with no supervision.
DINO is illustrated in the case of one single pair of views for simplicity.
The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters.
The output of the teacher network is centered with a mean computed over the batch.
Each network outputs a dimensional feature normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. A stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student. The teacher parameters are updated with the student parameters' exponential moving average (ema).
The model passes two different random transformations of an input image to the student network gθs and teacher network gθt. (Both have the same architecture but different parameters)
The output of the teacher network is centered with a mean computed over the batch.
Each network outputs a K dimensional feature denoted by PS and Pt, i.e. output probability distributions, which are normalized with a temperature softmax τS over the feature dimension:
With a fixed teacher, their similarity is then measured with a cross-entropy loss:
More precisely, from a given image, a set V of different views is generated. This set contains two global views, xg1 and xg2 and several local views of smaller resolution.
All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.
The loss is minimized:
In practice, the standard setting for multi-crop by using 2 global views at resolution 224² covering a large (for example greater than 50%) area of the original image, and several local views of resolution 96² covering only small areas (for example less than 50%) of the original image.
ResNet
ViT
The ViT architecture takes as input a grid of non-overlapping contiguous image patches of resolution N×N. In this paper we typically use N=16 (“/16”) or N=8 (“/8”). The patches are then passed through a linear layer to form a set of embeddings. An extra learnable token is added to the sequence. The role of this token is to aggregate information from the entire sequence and the projection head h is attached at its output.
ViT architectures do not use batch normalizations (BN) by default. Therefore, when applying DINO to ViT does not include any BN in the projection heads, making the system entirely BN-free.
Figure. Networks Configuration “Blocks” is the number of Transformer blocks, “dim” is channel dimension and “heads” is the number of heads in multi-head attention. “#tokens” is the length of the token sequence when considering 2242 resolution inputs, “#params” is the total number of parameters (without counting the projection head) and “im/s” is the inference time
Projection head h: g=h○f.
The projection head consists of a 3-layer multi-layer perceptron (MLP) with hidden dimension 2048 followed by l2 normalization and a weight normalized fully connected layer (Weight Norm) with K dimensions, which is similar to the design from SwAV [9].
DINO does NOT use a predictor, resulting in the exact same architecture in both student and teacher networks.
EMA to Update Teacher for Avoiding Collapse.
Model collapse is avoided with only a centering and sharpening of the momentum teacher outputs.
Centering prevents one dimension to dominate but encourages collapse to the uniform distribution, while the sharpening has the opposite effect. Applying both operations balances their effects.
A stop-gradient (sg) operator is applied on the teacher to propagate gradients only through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.
The centering operation only depends on first-order batch statistics and can be interpreted as adding a bias term c to the teacher:
The center c is updated with an exponential moving average (EMA), which allows the approach to work well across different batch sizes:
where m>0 is a rate parameter and B is the batch size.
Output sharpening is obtained by using a low value for the temperature τt in the teacher softmax normalization.
k-NN is to use the nearest neighbor classifier. The feature of an image is matched to the k nearest stored features that votes for the label. 20-NN is consistently working the best for most of the runs.
DINO performs on par with the state of the art on ResNet-50, validating that DINO works in the standard setting.
DINO is inspired from BYOL.
DINO does not use contrastive learning, but it shares the same strategy as SimCLR, which is to create paired image data by applying different data augmentation and comparing the output.
n2 n0
θ