Transformers

Transformer architecture (or attention-based network) has achieved state-of-the-art results in many NLP (Natural Language Processing) tasks. One of the main breakthroughs with the Transformer model could be the powerful GPT-3 released in the middle of the year, which has been awarded Best Paper at NeurIPS2020.

Overall, there are 2 major model architectures in the related work of adopting Transformers in Computer Vision tasks:

Pure Transformer
Hybrid: (CNNs+ Transformer)

Ref: Key milestones in the development of transformer. The visual transformer models are marked in red. (https://arxiv.org/pdf/2012.12556.pdf)

Overview

[Link]

Papers

Survey

Self-Attention

On the Relationship between Self-Attention and Convolutional Layers.
A Survey on Visual Transformer.
Transformers in Vision: A Survey.

Vision

"A Survey of Transformers"
"A Survey of Visual Transformers"
"Attention mechanisms and deep learning for machine vision: A survey of the state of the art"
"Transformers in Vision: A Survey", ACM Computing Surveys, 2021
- "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", 2022.
- "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022.
- "Vision Transformers for Action Recognition: A Survey", arXiv, 2022.
- "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022.
- "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022
- K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z Yang, Y. Zhang, and D. Tao, "A Survey on Vision Transformer", IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access), 2022.
- "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work"
- "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022
- "Visual Attention Methods in Deep Learning: An In-Depth Survey"

Remote Sensing

- "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022

Medical Imaging

- "Medical image analysis based on transformer: A Review"
- "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022
- "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022.
- "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?"
- "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022
- "Transformers in Medical Image Analysis: A Review"

3D

- "3D Vision with Transformers: A Survey"

Video

- "Video Transformers: A Survey", arXiv, 2022
- "Survey: Transformer based Video-Language Pre-training"

Multimodal

- "Multimodal Learning with Transformers: A Survey", arXiv, 2022

Point Cloud

- "Transformers in 3D Point Clouds: A Survey", arXiv, 2022

Image Captioning

- "Image Captioning In the Transformer Age"

Efficiency

- Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, "Efficient Transformers: A Survey", ACM Computing Surveys (CSUR), 2020
- Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, "Long Range Arena: A Benchmark for Efficient Transformers", arXiv preprint arXiv:2011.04006, 2020.

Self-Attention

Self-Attention: https://github.com/encounter1997/awesome-vision-transformers-plus

Classification

Detection

Segmentation

"Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation"
"Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers"
"End-to-End Video Instance Segmentation with Transformers"
"SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation"

Tracking

"TransTrack: Multiple-Object Tracking with Transformer"

Image Generation

"Image Transformer"
"Taming Transformers for High-Resolution Image Synthesis"

Image Processing

"Learning Texture Transformer Network for Image Super-Resolution"
"Learning Joint Spatial-Temporal Transformations for Video Inpainting"
"Colorization Transformer"
"Pre-Trained Image Processing Transformer"

Action Understanding

"Video Action Transformer Network"
"Video Transformer Network"

3D Point Cloud Processing

"PCT: Point Cloud Transformer"
"Point Transformer"

3D Motion Modeling

"Learning to Generate Diverse Dance Motions with Transformer"
"A Spatio-temporal Transformer for 3D Human Motion Prediction"

3D Human Body Modeling

"End-to-End Human Pose and Mesh Reconstruction with Transformers"

Others

"Music Transformer: Generating Music with Long-Term Structure"
"Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers"

Self-Attention [from HaloNets paper]

Several recent papers [3, 45, 12, 68, 52] have attempted using self-attention primitives to improve image classification accuracy over the strong and commonly used ResNet backbones [17, 18].
- Lambdanetworks: Modeling long-range interactions without attention.
- An image is worth 16x16 words: Transformers for image recognition at scale.
- Stand-alone self-attention in vision models.
- Bottleneck transformers for visual recognition.
- Exploring self-attention for image recognition.
Among them, the Stand-Alone Self-Attention (SASA) [45] is a fully self-attentive model that replaces every spatial convolution with local self-attention, which improves the performance of ResNet backbones while having fewer parameters and floating-point operations.
- Stand-alone self-attention in vision models
State-of-the-art convolutional models [57, 69, 44] use a variety of scaling techniques to achieve strong performance across a range of computation and parameter regimes.
- Designing network design spaces.
- Efficientnet: Rethinking model scaling for convolutional neural networks
- Learning transferable architectures for scalable image recognition

Self-Attention [from External Attention paper]

Due to its ability to capture long-range dependencies, the self-attention mechanism helps to improve performance in various natural language processing [1], [2] and computer vision [3], [4] tasks.
Self-attention works by refining the representation at each position via aggregating features from all other locations in a single sample, which leads to quadratic computational complexity in a sample. Thus, some variants attempt to approximate self-attention at a lower computational cost [5], [6], [7], [8].
Self-attention concentrates on the self-affinities between different locations within a single sample, and ignores potential correlations with other samples. It is easy to see that incorporating correlations between different samples can help to contribute to a better feature representation. For instance, features belonging to the same category but distributed across different samples should be treated consistently in the semantic segmentation task, and a similar observation applies in image classification and various other visual tasks.

Self-Attention [from DCANet paper]

To investigate channel interdependencies, SENet [18], GENet [17] and SGENet [20] leverage self-attention for contextual modeling.
For global context information, NLNet [38] and GCNet [7] introduce self-attention to capture long-range dependencies in non-local operations.
BAM [26] and CBAM [39] consider both channel-wise and spatial attentions.
Beyond channel and spatial dependencies, SKNet [21] applies self-attention to kernel size selection.
RANet [37] utilizes residual connections in attention block;
RA-CNN recurrently generates attention region-based on current prediction to learn the most discriminative region. By doing so, RA-CNN obtains an attention region from coarse to fine.
In GANet [5], the top attention maps generated by customized background attention blocks are up-sampled and sent to bottom background attention blocks to guide attention learning.

Unfilter

J. Ju, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[SKNet] Selective Kernel Networks [Paper]
BAM: Bottleneck Attention Module [Paper]
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks [Paper]
Dual Attention Network for Scene Segmentation_CVPR_2019 [Paper]
EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network [Paper]
ResT: An Efficient Transformer for Visual Recognition [Paper]
SA-NET: Shuffle Attention For Deep Convolutional Neural Networks [Paper]
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning [Paper]
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks [Paper]
A2-Nets: Double Attention Networks [Paper]
An Attention Free Transformer [Paper]
VOLO: Vision Outlooker for Visual Recognition [Paper]
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition [Paper]
Polarized Self-Attention: Towards High-quality Pixel-wise Regression [Paper]
Contextual Transformer Networks for Visual Recognition [Paper]
Attentional Feature Fusion [Paper]
Deep Visual Attention Prediction_2018 [Paper]
On the Relationship between Self-Attention and Convolutional Layers_2020 [Paper]
DMSANet: Dual Multi Scale Attention Network [Paper]
EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network [Paper]
Hypercolumns for Object Segmentation and Fine-grained Localization_CVPR_2015 [Paper]

Pure Transformer:

[Original Transformer] A. Vaswani, N. Shazeer, N. Paramr, J. Uszkoreit, L. Jones, A.N. Gomez, et al., “Attention is all you need,” Proceedings of 31st International Conference on Neural Information Processing Systems (NIPS 2017), 2017.
A Ghiasi, H Kazemi, E Borgnia, S Reich, M Shu, M. Goldblum, A. G. Wilson, T. Goldstein, "What do Vision Transformers Learn? A Visual Exploration", in arXiv:2212.06727, 2022.
"MRSFORMER: TRANSFORMER WITH MULTI RESOLUTION-HEAD ATTENTION", OpenReview, ICLR, 2023.
"WHEN VISION TRANSFORMERS OUTPERFORM RESNETS WITHOUT PRE-TRAINING OR STRONG DATA AUGMENTATIONS", ICLR, 2022.
[ViT] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv Preprint, arXiv2010.11929, 2020. [Code]
[NAT] A. Hassani, S. Walton, J. Li, S. Li, H. Shi, "Neighborhood Attention Transformer", arXiv preprint arXiv:2204.07143, 2022
[LV-ViT] Z. H. Jiang, Q. Hou, L. Yuan, D. Zhou, Y. Shi, X. Jin, A. Wang, and J. Feng, "All Tokens Matter: Token Labeling for Training Better Vision Transformers", Advances in Neural Information Processing Systems 34 (NeurIPS), 2021.
[ViTAE] Y. Xu, Q Zhang, J. Zhang, and D. Tao, "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", Advances in Neural Information Processing Systems, 2021. [Code]

ViT variants: DeiT, CoaT, ConViT, PiT, Swin, and Twin

ViT-Variant - solving requirement of a large amount of data for pre-training:

[DeiT] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, "Training data-efficient image transformers & distillation through attention", arXiv Preprint, arXiv2012.12877, 2020. [QuickRead]
[CaiT] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, ”Going deeper with image Transformers,” arXiv Preprint, arXiv2103.17329, 2021
[T2T-ViT] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, et al., “Tokens-to-token ViT: Training Vision Transformers from scratch on ImageNet,” arXiv Preprint, arXiv2101.11986, 2021.

ViT-Variant - solving high computational complexity:

[FAVOR+] K. Choromanski, V. Likhosherstov, D Dohan, X. Song, A. Gane, T. Sarlos, et al., “Rethinking attention with performers,” arXiv Preprint, arXiv2009.14974, 2020.
[PVT] W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 568-578, 2021.
[SOFT] J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang, and L. Zhang, "SOFT: Softmax-free Transformer with Linear Complexity", Advances in Neural Information Processing Systems 34 (NeurIPS), 2021. (Gaussian kernel function is used to replace the dot-product similarity)
[PVTv2] W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, "PVT v2: Improved baselines with Pyramid Vision Transformer", in Computational Visual Media (CVMJ), Vol. 8, pp. 415–424, 2022.
[TokenLearner] M. Ryoo, A. J. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, "TokenLearner: Adaptive Space-Time Tokenization for Videos", Part of Advances in Neural Information Processing Systems 34 (NeurIPS), 2021. [Code] (reduce the number of patches used by a ViT in an adaptive manner).
[ShiftViT] G Wang, Y Zhao, C Tang, C Luo, W Zeng, "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", arXiv preprint arXiv:2201.10801, 2022. (swap the attention operation with a no-parameter shifting operation)

ViT-Variant - generate multi-scale feature maps for dense prediction:

[Swin] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., “Swin Transformer: Hierarchical Vision Transformer using shifted windows,” arXiv Preprint, arXiv2103.14030, 2021. [Code]
[PiT] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S.J. Oh, “Rethinking spatial dimensions of Vision Transformers,” arXiv Preprint, arXiv2103.16302, 2021. [Code]
[HaloNet] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens, "Scaling Local Self-Attention for Parameter Efficient Visual Backbones", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12894-12904, 2021.
[Multi-Scale Vision Longformer] P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2998-3008, 2021.
[DeepViT] D Zhou, B Kang, X Jin, L Yang, X Lian, Z Jiang, Q Hou, J. Feng, "DeepViT: Towards Deeper Vision Transformer", arXiv:2103.11886, 2021.
[Refined-ViT] D. Zhou, Y. Shi, B. Kang, W. Yu, Z. Jiang, Y. Li, X.Jin, Q. Hou, J. Feng, "Refiner: Refining Self-attention for Vision Transformers", arXiv:2106.03714, 2021.
[CSWin] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen and B. Guo, "CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12124-12134, 2022.
[MViTv2] "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
[ViTDet] "Exploring Plain Vision Transformer Backbones for Object Detection", [QuickRead]

Classification

Benchmarking and Boosting Transformers for Medical Image Classification

Detection

[ViTMAE-Detect*] Y Li, S Xie, X Chen, P Dollar, K He, R Girshick, "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv preprint arXiv:2111.11429, 2021. (ViT + MAE + FPN)
[Mask Auto-Labeler] "Vision Transformers Are Good Mask Auto-Labelers"
"Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection"
[YOLOS] "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection"

Segmentation:

[DPT] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision Transformers for Dense Prediction” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 179–12 188, 2021. (upscaled the low-resolution feature maps of ViT to high-resolution for dense prediction tasks)
[LawinASPP] "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention"

Feature Fusion:

[TokenFusion] Y Wang, X Chen, L Cao, W Huang, F Sun, Y Wang, "Multimodal Token Fusion for Vision Transformers", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12186-12195, 2022.

Videos:

[MViT] H Fan, B Xiong, K Mangalam, Y Li, Z Yan, J Malik, "Multiscale Vision Transformer", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824-6835, 2021. [Code] (pooling attention across space-time resolution and channel dimension)
[ViVit] A Arnab, M Dehghani, G Heigold, C Sun, M Lučić, C. Schmid, "ViVit: A video vision transformer", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6836-6846, 2021.
[Swin-Video] Z Liu, J Ning, Y Cao, Y Wei, Z Zhang, S Lin, H Hu, "Video Swin Transformer", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3202-3211, 2022.
[MTV] S Yan, X Xiong, A Arnab, Z Lu, M Zhang, C Sun, C. Schmid, "Multiview Transformers for Video Recognition", in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3333-3343, 2022. [Code]

OCR:

- [DiT] "DiT: Self-supervised Pre-training for Document Image Transformer"

Multi-modal:

- [ViLT] W. Kim, B. Son, I. Kim, "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5583-5594, 2021.
- BEIT: BERT Pre-Training of Image Transformers [Paper]
- [BEIT-3] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal , O. K. Mohammed , S. Singhal, S. Som, and F. Wei, "Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks", arXiv preprint arXiv:2208.10442, 2022

Other Pure:

Some works focused on data efficiency, with minor changes in the architecture [26, 33, 11, 27, 10], while others focused on efficiency and transferability to downstream tasks [18, 34, 28, 10] - NAT

Visualization

"What do Vision Transformers Learn? A Visual Exploration"

Hybrid (CNN + Transformer)

[SMESwin Unet] "Merging CNN and Transformer for Medical Image Segmentation"
[MOAT] C Yang, S Qiao, Q Yu, X Yuan, Y Zhu, A Yuille, H Adam, "Moat: Alternating mobile convolution and attention brings strong vision models", ICLR, 2023
[SFA] W Wang, Y Cao, J Zhang, F He, ZJ Zha, Y Wen, D Tao, "Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers", Proceedings of the 29th ACM International Conference on Multimedia, pp. 1730–1738, 2021. [Code]
"Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture"
[MaxViT] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, "MaxViT: Multi-Axis Vision Transformer", ECCV , 2022 [Code]
[Trans2Seg] E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, et al., “Segmenting transparent object in the wild with transformer,” arXiv Preprint, arXiv2101.08461, 2021.
[CCT] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, "Escaping the Big Data Paradigm with Compact Transformers", in arXiv:2104.05704, 2022. [Code]
[CoAtNet] Z. Dai, H. Liu, Q. V. Le, and M. Tan, "CoAtNet: Marrying Convolution and Attention for All Data Sizes", in Advances in Neural Information Processing Systems, 2021.
[ConViT] S. d'Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", in arXiv:2103.10697, 2021. [Code]

[MobileViT] S Mehta, M Rastegari, "Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer", arXiv preprint arXiv:2110.02178, 2021. [Code] (Transformers capture long-range dependencies that result in global representations. Convolutions capture spatial relationships that model locality.)
[ViT-C] "Early convolutions help transformers see better", 2021. (adds an early convolutional stem to ViT)
[CvT] "Cvt: Introducing convolutions to vision transformers", 2021. (modifies the multi-head attention in transformers and uses depth-wise separable convolutions instead of linear projections)
[BoTNet] "Bottleneck transformers for visual recognition", 2021. (replaces the standard 3×3 convolution in the bottleneck unit of ResNet with multi-head attention).
[ConViT] "Convit: Improving vision transformers with soft convolutional inductive biases", 2021. (incorporates soft convolutional inductive biases using a gated positional self-attention).
[PiT] "Rethinking spatial dimensions of vision transformers", 2021. (extends ViT with depth-wise convolution-based pooling layer).
- [Contnet] H. Yan, Z. Li, W. Li, C. Wang, M. Wu, and C. Zhang, "Contnet: Why not use convolution and transformer at the same time?", arXiv preprint arXiv:2104.13497, 2021.
[CoaT] W Xu, Y Xu, T Chang, Z Tu, "Co-scale Conv-attentional Image Transformers", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9981-9990, 2021.
- [Twins] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, "Twins: Revisiting spatial attention design in vision transformers", Advances in Neural Information Processing Systems 34 (NeurIPS), 2021.
- [Shuffle] Z. Huang, Y. Ben, G. Luo, P. Cheng, G. Yu, and B. Fu, "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv preprint arXiv:2106.03650, 2021.
- [VOLO] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, "VOLO: Vision Outlooker for Visual Recognition", arXiv preprint arXiv:2106.13112, 2021.
- [CMT] J Guo, K Han, H Wu, Y Tang, X Chen, Y Wang, C Xu, "CMT: Convolutional Neural Networks Meet Vision Transformers", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12175-12185, 2022.

BERT-Variant:

- [BootMAE] "Bootstrapped Masked Autoencoders for Vision BERT Pretraining"

Segmentation:

[SETR] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” arXiv Preprint, arXiv2012.15840, 2020.
[SegFormer] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", in Advances in Neural Information Processing Systems 34 (NeurIPS), 2021. [Video] (multi-scale token - down-sampled the key and value into larger-size tokens using a scale ratio)
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel classification is not all you need for semantic segmentation. NeurIPS, 2021. 5, 16, 17

Panoptic Segmentation:

[CMT-DeepLab] Q. Yu, H. Wang, D. Kim, S. Qiao, M. Collins, Y. Zhu, H. Adam, A. Yuille, L. C. Chen, "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2560-2570, 2022 [Review]

CNN+Attention In-parallel:

Mobile-former: Bridging mobilenet and transformer.
Conformer: Local Features Coupling Global Representations for Visual Recognition
[CrossViT] C.-F. R. Chen, Q. Fan, and R. Panda, “Cross-Attention Multi-Scale Vision Transformer for Image Classification”, in Proceedings of the IEEE/CVF International Conference on Computer Vision ICCV, pp. 357–366 , 2021. (multi-scale token - use small and large patch sizes to generate tokens and fuse these two sizes of tokens via cross-attention).

One-Shot Object Detection:

[AIT] DJ Chen, HY Hsieh, TL Liu, "Adaptive Image Transformer for One-Shot Object Detection", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12247-12256, 2021.

Multi-Modal

[MAGNETO] H. Wang, S. Ma, S. Huang, L. Dong, W. Wang, Z. Peng, Y. Wu, P. Bajaj, S. Singhal, A. Benhaim, B. Patra, Z. Liu, V. Chaudhary, X. Song, and F. Wei, "Foundation Transformers", arXiv preprint arXiv:2210.06423, 2022.

Other Hybrids:

Points as Queries: Weakly Semi-supervised Object Detection by Points_CVPR_2021 [Personal Summary]

"Attention augmented convolutional networks", In Proceedings of the IEEE International Conference on Computer Vision, pages 3286–3295, 2019.
Self-Attention Generative Adversarial Networks_ICML_2019 [Paper]
VideoBERT: A joint model for video and language representation learning_ICCV_2019 [Paper]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision_arXiv_2020 [Paper]
Feature Pyramid Transformer_ECCV_2020 [Paper]
Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers_arXiv_2020 [Paper]
[SASA] Stand-alone self-attention in vision models
Exploring self-attention for image recognition.
- augmenting a ConvNet with self-attention/non-local modules [8, 55, 66, 79 - ConvNext] to capture long-range dependencies.
reintroducing convolutional priors to ViT, either in an explicit [15, 16, 21, 82, 86, 88 - ConvNext] or implicit [45 - ConvNext] fashion.

Natual Language Processing [Link]

GPT
BERT
Transformer XL
GPT-2
ERNIE
XLNet
RoBERTa
ALBERT
CTRL
BART
DialoGPT
DistilBERT
T5
XLM-RoBERTa
Pegasus
mBART
Turing-NLG
ELECTRA
Megatron
GPT-3
Big Bird
VIT
DALL-E
Switch
CLIP
GPT-Neo
Swin Transformer
Decision Transformers
Trajectory Transformers
Wu Dao 2.0
AlphaFold
HTML
Jurassic-1
MT-NLG (Megatron Touring NLG)
Anthropic Assistant (see also)
GLaM
GLIDE
Gopher
StableDiffusion
CM3
GPTInstruct
LAMDA
Chinchilla
DQ-BART
GopherCite
SeeKer
DALL-E-2
Flamingo
GPT-NeoX-20B
Palm
Gato
OPT
Global Context ViT
Imagen
Minerva
BLOOM
BlenderBot 3
Sparrow
ChatGPT
GPT3.5

NLP:

[BERT] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", in arXiv preprint arXiv:1810.04805, 2018.
Fast Transformers with Clustered Attention", [Video] [FastRead] (+ Clustering)
GPT [21, 22, 1 - NAT]
- Generative Pretraining from Pixels (Image GPT), uses Transformer for pixel-level image completion
- mBERT
- [CANINE] "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"
- https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/ (add from this page)

Others

[SSA] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted Self-Attention via Multi-Scale Token Aggregation”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR, pp. 10 853–10 862, 2022. (multi-scale token - adopted two different scale ratios to downsample the base tokens).
[QuadTree] S. Tang, J. Zhang, S. Zhu, and P. Tan, “QuadTree Attention for Vision Transformers” arXiv preprint arXiv:2201.02767, 2022.
"GLiT: Neural Architecture Search for Global and Local Image Transformer"
"DeepNet: Scaling Transformers to 1,000 Layers"
[Conviformers] M. Vaishnav, T. Fel, I. F. Rodrıguez, and T. Serre, "Conviformers: Convolutionally guided Vision Transformer", arXiv preprint arXiv:2208.08900, 2022. [Fast Read]
Hydra Attention: Efficient Attention with Many Heads
[NLNet] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[LG-Transformer] Local-to-Global Self-Attention in Vision Transformers_2021 [Paper]
[MetaFormer] [PoolFormer] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, "MetaFormer Is Actually What You Need for Vision", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10819-10829, 2022.
[TRT] H Su, Y Ye, Z Chen, M Song, L Cheng, "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv preprint arXiv:2208.01838, 2022. [Code]

Segmentation:

[MaX-DeepLab] H. Wang, Y. Zhu, H. Adam, A. Yuille, L. C. Chen, "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5463-5474, 2021.
[VisTR] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, "End-to-End Video Instance Segmentation With Transformers", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8741-8750, 2021.
Segmenter: Transformer for semantic segmentation.
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers.

Image Enhancement:

Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364, 2020.
High-fidelity pluralistic image completion with transformers.
Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5791–5800, 2020.

Image Generation:

Image transformer. In International Conference on Machine Learning, pages 4055–4064. PMLR, 2018.
Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.

Video Processing:

Learning joint spatial-temporal transformations for video inpainting. In European Conference on Computer Vision, pages 528– 543. Springer, 2020.
End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June, 2018.
J. Wang, X. Yang, H. Li, L. Liu, Z. Wu, and Y.-G. Jiang, “Efficient Video Transformers with Spatial-Temporal Token Selection” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, pp. 69–86, 2022. (important token selection - score-based algorithm)

Vision-language Task:

VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
[LXMERT] LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP, 2019.
Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
VL-BERT: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066, 2019.
Unified vision-language pretraining for image captioning and VQA. AAAI, 2020.
Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
[CLIP] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision", Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8748-8763, 2021. [Code]
LViT: Language meets Vision Transformer in Medical Image Segmentation

Position Embeddings:

Y.-A. Wang and Y.-N. Chen, “What do position embeddings learn? An empirical study of pre-trained language model positional encoding,” arXiv Preprint, arXiv2010.04903, 2020. (Positional Embedding study, NLP)
[CPVT] X Chu, Z Tian, B Zhang, X Wang, X Wei, H Xia, C Shen, "Conditional Positional Encodings for Vision Transformers", in arXiv:2102.10882, 2021. [Code]
Do we really need explicit position encodings for vision transformers?
absolute positional encoding (APE) [56],
relative positional encoding (RPE) [38, 46]
conditional positional encoding (CPE) [12].

Applications:

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference [Paper] [Code]
End-to-end Lane Shape Prediction with Transformers_arXiv_2020 (Lane detection) [Paper]
Detection of tuberculosis from chest X-ray images: Boosting the performance with vision transformer and transfer learning_ESA_2021 [Paper]
TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [Paper]
Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks_ACL_2021 [Paper]
[VST] Visual Saliency Transformer [Paper]
[SANet] Jie Mei, Ming-Ming Cheng, Gang Xu, Lan-Ruo Wan, and Huan Zhang, "SANet: A Slice-Aware Network for Pulmonary Nodule Detection", in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [Summary]
- Transformer For Medical Image Analysis [Link]
- M. Dong, and X. Yu, "Robotic grasp detection based on Transformer", arXiv preprint arXiv:2205.15112, 2022.
- "Landslide Detection Based on ResU-Net with Transformer and CBAM Embedded: Two Examples with Geologically Different Environments"

Speech

Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss", in 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.

Tree Transformer:

- YS Wang, HY Lee, YN Chen, "Tree transformer: Integrating tree structures into self-attention", arXiv preprint arXiv:1909.06639, 2019. (NLP)

Image Restoration:

[SwinIR] J. Liang, J. Cao, G. Sun, K. Zhang, L. V. Gool, R. Timofte, "SwinIR: Image Restoration Using Swin Transformer", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1833-1844, 2021. [Code]
[9, 5, 82 - SwinIR]
[IPT]
[VSR-Transformer]
[U-shaped Swin]

Image Classification

ResT: An Efficient Transformer for Visual Recognition
Training vision transformers for image retrieval
[TNT] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, Y. Wang, "Transformer in Transformer", Advances in Neural Information Processing Systems 34 (NeurIPS), 2021.
"Transreid: Transformer-based object reidentification"
Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet
Training data-efficient image transformers & distillation through attention.
Going deeper with image transformers
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Cvt: Introducing convolutions to vision transformers.
Coscale conv-attentional image transformers.
Incorporating convolution designs into visual transformers.
[66, 19, 84, 56, 45, 55, 75 - SwinIR]

Object Detection

[6, 53, 74, 56 - SwinIR]

Segmentation

[84, 99, 56, 4 - SwinIR]

Crowd Counting

[47, 69 - SwinIR]

From 陳邦尉 - Robotics Lab - Grasping

Paper 14: S. Mehta, and M. Rastegari, “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer”, In International Conference on Learning Representations, 2022
Paper15: A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens, “Scaling Local Self-Attention for Parameter Efficient Visual Backbones”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12894-12904
Paper16: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging MobileNet and Transformer”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5270-5279
Paper17: A. Hassani, S. Walton, J. Li, and H. Shi, “Neighborhood Attention Transformer”, 2022. [Online]. Available: arXiv:2204.07143
Paper 18: Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei and B. Guo, “Swin Transformer V2: Scaling Up Capacity and Resolution”, Computer Vision and Pattern Recognition Conference, 2022, pp. 12009-12019
Paper 19: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan and Z. Liu, “Benchmarking Detection Transfer Learning with Vision Transformers”, Cooperative Online Resource Catalog , 2021
Paper 20: J. Meng, Z. Tan, A. Srinivas, Y. Yu, P. Wang, S. Liu, “TL-med: A Two-stage transfer learning recognition model for medical images of COVID-19”, Biocybernetics and Biomedical Engineering, 2022
Paper 21: HE. Kim , A. Cosa‑Linan, N. Santhanam, M. Jannesari, ME. Maros and T. Ganslandt, “Transfer learning for medical image classification: a literature review”, BMC Medical Imaging volume 22, 69, 2022
Paper 22: H. Touvron, M. Cord, A. El-Nouby, J. Verbeek, and H. Jegou, “Three things everyone should know about vision transformers.”, arXiv preprint arXiv:2203.09795, 2022.
Paper 23: A. Hassani, and H. Shi, “Dilated Neighborhood Attention Transformer”, arXiv:2209.15001, 2022. (PPT)

From 温彥博 - Robotics Lab - Depth Prediction

Paper 15 - G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, "Transformer-based attention networks for continuous pixel-wise prediction," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. (PPT)
Paper 16 - Ranftl, René, Alexey Bochkovskiy, and Vladlen Koltun, "Vision transformers for dense prediction," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Efficient_Attention_Mechanism

Vision Transformer:

Scaling local self-attention for parameter-efficient visual backbones.
Criss-cross attention for semantic segmentation.

Content-independent Sparse Attention Mechanism:

[Image Transformer] N Parmar, A Vaswani, J Uszkoreit, L Kaiser, N Shazeer, A. Ku, D. Tran, "Image transformer", Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4055-4064, 2018.
[Sparse Transformers] R Child, S Gray, A Radford, I Sutskever, "Generating long sequences with sparse transformers", arXiv preprint arXiv:1904.10509, 2019. - NLP
Blockwise self-attention for long document understanding.
[Axial Self-attention] Axial attention in multidimensional transformers.

Content-dependent Sparse Attention Mechanism:

Reformer: The efficient transformer.
[Adaptive Transformers] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin, "Adaptive attention span in transformers", arXiv preprint arXiv:1905.07799, 2019
[Routing Transformers] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, "Efficient content-based sparse attention with routing transformers", Transactions of the Association for Computational Linguistics 9: 53–68, 2021.
Sparse sinkhorn attention. - NLP
Informer: Beyondefficient transformer for long sequence time-series forecasting.

Memory-based Mechanism:

[Compressive Transformers] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lillicrap, "Compressive transformers for long-range sequence modeling", arXiv preprint arXiv:1911.05507, 2019. - NLP
[Set Transformer] Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019.

Low-rank based Mechanism:

[Linformer] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-attention with linear complexity", arXiv preprint arXiv:2006.04768, 2020.

(Generalized) Kernel-based Mechanism:

[Performer] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller, "Rethinking Attention with Performers", International Conference on Learning Representations (ICLR), 2021. - NLP
[Linear Transformers] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5156-5165, 2020. - NLP

Hybrid Attention Mechanism:

[Longformer] I. Beltagy, M. E. Peters, and A. Cohan, "Longformer: The long-document transformer", arXiv preprint arXiv:2004.05150, 2020. (sparsity + memory). - NLP
[BigBird] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed, "Big Bird: Transformers for Longer Sequences", Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. (sparsity + memory)
[ETC] J. Ainslie, S. Ontañón, C. Alberti, P. Pham, A. Ravula, and S. Sanghai, "ETC: Encoding Long and Structured Data in Transformers", Computing Research Repository (CoRR), 2021. (sparsity + memory).

Eliminate Dot Product:

[LightConv] F Wu, A Fan, A Baevski, YN Dauphin, M Auli, "Pay Less Attention with Lightweight and Dynamic Convolutions", arXiv preprint arXiv:1901.10430, 2019. - NLP (replace the dot product self-attention with dynamic lightweight depthwise convolution).
[Synthesizers] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking self-attention in transformer models", arXiv preprint arXiv:2005.00743, 2020 (sparsity + low-rank; uses attention weights predicted from inputs to eliminate dot product attention).
[Sinkhorn Transformer] Y Tay, D Bahri, L Yang, D Metzler, DC Juan, "Sparse Sinkhorn Attention", Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9438-9447, 2020. (uses a differentiable sorting operation to identify relevant comparisons that may not be local in the original sequence order)
[AFT] S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang, and J. Susskind, "An Attention Free Transformer", arXiv preprint arXiv arXiv:2105.14103, 2021.

Unfilter:

"DeLighT: Deep and Light-weight Transformer"
"Reducing Activation Recomputation in Large Transformer Models", [QuickRead]
"Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
Scaling Vision Transformers_2021 [Paper]
Instances as Queries_2021
Bottleneck Transformers for Visual Recognition
Self-Attention Attribution: Interpreting Information Interactions Inside Transformer
Focal Self-attention for Local-Global Interactions in Vision Transformers_2021 [Paper]
SOLQ: Segmenting Objects by Learning Queries_2021
RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder_2021
ISTR: End-to-End Instance Segmentation with Transformers_2021
Contextual Transformer Networks for Visual Recognition_2021
LoFTR: Detector-Free Local Feature Matching with Transformers_CVPR_2021
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers_2021
XCiT: Cross-Covariance Image Transformers_2021
Self-Supervised Learning with Swin Transformers_2021
Rethinking and Improving Relative Position Encoding for Vision Transformer_2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers_2021
AutoFormer: Searching Transformers for Visual Recognition_2021
Vision Transformer with Progressive Sampling_ICCV_2021 [Paper]
Go Wider Instead of Deeper_2021 [Paper]
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding_2021 [Paper]
ViTGAN: Training GANs with Vision Transformers_2021 [Paper]
Self-Attention Attribution: Interpreting Information Interactions Inside Transformer [Paper]
Token Shift Transformer for Video Classification [Paper]
Vision Transformer with Progressive Sampling [Paper]
Finetuning Pretrained Transformers into Variational Autoencoders_2021 [Paper]
On the Relationship between Self-Attention and Convolutional Layers_ICLR_2020 [Paper]
[Moco-v3] An Empirical Study of Training Self-Supervised Vision Transformers [Paper]
SOTR: Segmenting Objects with Transformers [Paper]
RoFormer: Enhanced Transformer with Rotary Position Embedding [Paper]
Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper]
PCT: Point cloud transformer [Paper]
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing_ICCV_2021 [Paper]
LocalViT: Bringing Locality to Vision Transformers [Paper]

References

Models:

Source Code:

Books:

Course:

https://web.stanford.edu/class/cs25/?fbclid=IwAR0ehuZUOlGsAyRWtn7qwqziIIYD1apKLo-KhBRYn-tVM6i3m1Yymq-kA70

Reading:

Page updated

Google Sites

Report abuse

Transformers

Overview

Papers

Survey

Self-Attention

Vision

Remote Sensing

Medical Imaging

3D

Video

Multimodal

Point Cloud

Image Captioning

Efficiency

Self-Attention

References

About Me: