[MaxViT] MaxViT: Multi-axis Vision Transformer
{ViT}
Paper: https://arxiv.org/abs/2204.01697
Code:
1) Motivation, Objectives and Related Works:
Motivation:
Objectives:
MaxViT: Multi-axis Vision Transformer highlights how far vision transformers have come in recent years. While early vision transformers suffered from quadratic complexity, many tricks have been implemented to apply vision transformers to larger images with linear scaling complexity.
In MaxViT, this is achieved by decomposing an attention block into two parts with local-global interaction:
local attention ("block attention");
global attention ("grid attention").
It's worth mentioning that MaxViT is a convolutional transformer hybrid featuring convolutional layers as well.
And it can be used for predictive modeling (incl. classification, object detection, and instance segmentation) as well as generative modeling.
MaxViT highlights: the current trend goes towards combining vision transformers and convolutional networks into hybrid architectures.
Introduction:
Related Works:
Contribution:
2) Methodology:
Method 1:
Method 2:
3) Experimental Results:
Dataset:
Metrics:
Experimental Results:
Ablations:
n2 n0
θ