[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
{Aggregating neighboring Tokens; deep-narrow structure backbone}
{Aggregating neighboring Tokens; deep-narrow structure backbone}
0) Motivation, Object and Related works:
Motivation:
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification.
The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification.
However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because:
The simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency;
The redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples.
Objectives:
Propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates:
A layer-wise Tokens-to-Token (T2T) transformation to progressively structures the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced;
An efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.
T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet.
It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet.
For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet. 1
The overview architecture of T2T-ViT
Weakness of ViT: ViT requires pre-training on a large-size dataset.
Simple tokenization process in ViT cannot well capture important local structures in an input image. The local structures such as edges or lines often appear in several neighboring patches, rather than one; however, the tokenization process in ViT simply divides an image into non-overlapping patches and independently converts each into an embedding.
The Transformer architecture used in the original ViT was not well-designed and optimized, leading to redundancies in the feature maps.
Deal:
1) A tokenization method:
To cope with the first problem, they proposed a tokenization method, named Tokens-to-token (T2T) module, that iteratively aggregates neighboring tokens into one token using a process named T2T process:
A sequence of tokens is passed into a self-attention module to improve the relation between tokens. The output of this step is another sequence of the same size as its input.
The output sequence from the previous step is reshaped back into a 2D-array of tokens.
The 2D-array of tokens is then divided into overlapping windows, in which neighboring tokens in the same window are concatenated into a longer token. The result of this process is a shorter 1D-sequence of higher-dimensional tokens.
The T2T process can be iterated to better improve the representation of the input image. (ex: twice in the T2T module)
The Tokens-to-token process
2) Deep-narrow structure Backbone:
Exploits more Transformer layers (deeper) to improve feature richness and reduces the embedding dimension (narrower) to maintain the computational cost, gave the best results among the compared architecture designs.
The sequence of tokens generated by the T2T module is prepended with a classification token (yellow color), as in the original ViT, and is then fed into the deep-narrow Transformer, which is named T2T-ViT backbone, to make a prediction.
References:
https://sertiscorp.medium.com/vision-transformers-a-review-part-ii-a31136cf848d