[PVTv2] Improved Baselines with Pyramid Vision Transformer
{High Output Resolution, A Progressive Shrinking Pyramid, Spatial-reduction Attention - SRA}
{High Output Resolution, A Progressive Shrinking Pyramid, Spatial-reduction Attention - SRA}
CNN vs ViT vs Proposed PVT
0) Motivation, Object and Related works:
Motivation:
This work investigates a simpler, convolution-free backbone network useful for many dense prediction tasks.
The recently proposed Vision Transformer (ViT) was designed for image classification specifically.
Objectives:
Introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting the Transformer to various dense prediction tasks.
PVT has several merits:
Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps.
PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones.
PVT boosts the performance of many downstream tasks, including object detection, instance, and semantic segmentation.
PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2).
References:
https://sh-tsang.medium.com/review-pvtv2-improved-baselines-with-pyramid-vision-transformer-5cfd354a53bd