[SSA] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted Self-Attention via Multi-Scale Token Aggregation”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR, pp. 10 853–10 862, 2022. (multi-scale token - adopted two different scale ratios to downsample the base tokens).
[QuadTree] S. Tang, J. Zhang, S. Zhu, and P. Tan, “QuadTree Attention for Vision Transformers” arXiv preprint arXiv:2201.02767, 2022.
"GLiT: Neural Architecture Search for Global and Local Image Transformer"
"DeepNet: Scaling Transformers to 1,000 Layers"
[Conviformers] M. Vaishnav, T. Fel, I. F. Rodrıguez, and T. Serre, "Conviformers: Convolutionally guided Vision Transformer", arXiv preprint arXiv:2208.08900, 2022. [Fast Read]
Hydra Attention: Efficient Attention with Many Heads
[NLNet] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[LG-Transformer] Local-to-Global Self-Attention in Vision Transformers_2021 [Paper]
[MetaFormer] [PoolFormer] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, "MetaFormer Is Actually What You Need for Vision", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10819-10829, 2022.
[TRT] H Su, Y Ye, Z Chen, M Song, L Cheng, "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv preprint arXiv:2208.01838, 2022. [Code]
VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
[LXMERT] LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP, 2019.
Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
VL-BERT: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066, 2019.
Unified vision-language pretraining for image captioning and VQA. AAAI, 2020.
Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
[CLIP] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision", Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8748-8763, 2021. [Code]
LViT: Language meets Vision Transformer in Medical Image Segmentation
ResT: An Efficient Transformer for Visual Recognition
Training vision transformers for image retrieval
[TNT] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, Y. Wang, "Transformer in Transformer", Advances in Neural Information Processing Systems 34 (NeurIPS), 2021.
"Transreid: Transformer-based object reidentification"
Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet
Training data-efficient image transformers & distillation through attention.
Going deeper with image transformers
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Cvt: Introducing convolutions to vision transformers.
Coscale conv-attentional image transformers.
Incorporating convolution designs into visual transformers.
[66, 19, 84, 56, 45, 55, 75 - SwinIR]
Paper 14: S. Mehta, and M. Rastegari, “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer”, In International Conference on Learning Representations, 2022
Paper15: A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens, “Scaling Local Self-Attention for Parameter Efficient Visual Backbones”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12894-12904
Paper16: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-Former: Bridging MobileNet and Transformer”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5270-5279
Paper17: A. Hassani, S. Walton, J. Li, and H. Shi, “Neighborhood Attention Transformer”, 2022. [Online]. Available: arXiv:2204.07143
Paper 18: Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei and B. Guo, “Swin Transformer V2: Scaling Up Capacity and Resolution”, Computer Vision and Pattern Recognition Conference, 2022, pp. 12009-12019
Paper 19: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan and Z. Liu, “Benchmarking Detection Transfer Learning with Vision Transformers”, Cooperative Online Resource Catalog , 2021
Paper 20: J. Meng, Z. Tan, A. Srinivas, Y. Yu, P. Wang, S. Liu, “TL-med: A Two-stage transfer learning recognition model for medical images of COVID-19”, Biocybernetics and Biomedical Engineering, 2022
Paper 21: HE. Kim , A. Cosa‑Linan, N. Santhanam, M. Jannesari, ME. Maros and T. Ganslandt, “Transfer learning for medical image classification: a literature review”, BMC Medical Imaging volume 22, 69, 2022
Paper 22: H. Touvron, M. Cord, A. El-Nouby, J. Verbeek, and H. Jegou, “Three things everyone should know about vision transformers.”, arXiv preprint arXiv:2203.09795, 2022.
Paper 23: A. Hassani, and H. Shi, “Dilated Neighborhood Attention Transformer”, arXiv:2209.15001, 2022. (PPT)