[CaiT] Going deeper with image Transformers
{Knowledge Distillation}
Paper: https://arxiv.org/abs/2103.17239
{Knowledge Distillation}
Paper: https://arxiv.org/abs/2103.17239
0) Motivation, Object and Related works:
Motivation:
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks.
However the optimization of image transformers has been little studied so far.
Objectives: Class-attention in image Transformer (CaiT)
Build and optimize deeper transformer networks for image classification.
In particular, we investigate the interplay of architecture and optimization of such dedicated transformers.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
This leads us to produce models whose performance does not saturate early with more depth.
CaiT able to train on the ImageNet-1k dataset while achieving competitive performance.
Different to ViT:
It utilizes a deeper Transformer, which aims to improve the representational power of features.
A technique called LayerScale is proposed to facilitate the convergence of training the deeper Transformer.
LayerScale introduces a learnable, per-channel scaling factor, which is inserted after each attention module to stabilize the training of deeper layers.
This technique allows CaiT to gain benefit from using the deeper Transformer, while there is no evidence of improvement when increasing the depth in ViT or DeiT.
CaiT applies different types of attention at different stages of the network:
The normal self-attention (SA) in the early stage
The class-attention (CA) in the later stage.
The reason is to separate two tasks with contradictory objectives from each other.
CaiT architecture:
The class token is inserted after the first stage, which is different from ViT.
This allows the SA to focus on associating each token to each other, without the need of summarizing the information for the classification.
Once the class token is inserted, the CA, then, integrates all information into it to build a useful representation for the classification step.
References:
https://sertiscorp.medium.com/vision-transformers-a-review-part-ii-a31136cf848d