[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
{High Output Resolution, A Progressive Shrinking Pyramid, Spatial-reduction Attention - SRA}
{High Output Resolution, A Progressive Shrinking Pyramid, Spatial-reduction Attention - SRA}
CNN vs ViT vs Proposed PVT
0) Motivation, Object and Related works:
Motivation:
This work investigates a simpler, convolution-free backbone network useful for many dense prediction tasks.
The recently proposed Vision Transformer (ViT) was designed for image classification specifically.
Objectives:
Introduce the Pyramid Vision Transformer (PVT), which overcomes the difficulties of porting the Transformer to various dense prediction tasks.
PVT has several merits:
Different from ViT that typically yields low-resolution outputs and incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the computations of large feature maps.
PVT inherits the advantages of both CNN and Transformer, making it a unified backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones.
PVT boosts the performance of many downstream tasks, including object detection, instance, and semantic segmentation.
PVT+RetinaNet achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2).
Pyramid Vision Transformer (PVT)
Overall Architecture:
The entire model is divided into four stages, each of which is comprised of a patch embedding layer and a Li-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).
In the first stage, given an input image of size H×W×3, it is first divided into HW/4² patches, each of size 4×4×3.
Then, the flattened patches are fed to a linear projection and embedded patches of size HW/4²×C1 are obtained.
After that, the embedded patches along with a position embedding are passed through a Transformer encoder with L1 layers, and the output is reshaped to a feature map F1 of size H/4×W/4 ×C1.
Feature Pyramid for Transformer:
In the same way, using the feature map from the previous stage as input, the following feature maps are obtained: F2, F3, and F4, whose strides are 8, 16, and 32 pixels with respect to the input image.
Thus, PVT uses a progressive shrinking strategy to control the scale of feature maps by patch embedding layers. The feature pyramid {F1, F2, F3, F4} are obtained.
Spatial-reduction attention (SRA):
Spatial-reduction attention or SRA was proposed to speed up the computation of Pyramid Vision Transformer (PVT).
SRA reduces the dimension of the key (K) and value (V) matrices by a factor of Ri2, where i indicates the stage in the Transformer model.
The spatial reduction consists of two steps:
Concatenating neighboring tokens with a dimension Ci in a non-overlapping window of size RixRi into a token of size Ri2Ci.
Linearly projecting each of the concatenated tokens to a token of dimension Ci and performing the normalization process.
The time and space complexities decrease because the number of tokens is reduced by the spatial reduction.
Comparison between the regular attention (left) and SRA (right)
Transformer Encoder:
Since PVT needs to process high-resolution (e.g., 4-stride) feature maps, a spatial-reduction attention (SRA) layer is proposed to replace the traditional multi-head attention (MHA) layer in the encoder.
Similar to MHA, the proposed SRA receives a query Q, a key K, and a value V as input, and outputs a refined feature.
The difference is that the proposed SRA reduces the spatial scale of K and V before the attention operation, which largely reduces the computational/memory overhead.
where SR is the operation for reducing the spatial dimension of the input sequence (i.e., K or V), which is written as:
With Ws is the linear projection that reduces the dimension of the input sequence to Ci, and Reshape(x, Ri) is an operation of reshaping the input sequence x to a sequence of size:
And the attention operation Attention(·) is calculated as in the original Transformer:
The computational/memory costs of attention operation are R²i times lower than those of MHA, so the proposed SRA can handle larger input feature maps/sequences with limited resources.
PVT Variants:
PVT-Tiny, PVT-Small, PVT-Medium, PVT-Large are designed as above.
References:
https://sh-tsang.medium.com/review-pyramid-vision-transformer-a-versatile-backbone-for-dense-prediction-without-convolutions-bafc9dc83149