Segformer
{Scaled self-attention, }
Paper: https://arxiv.org/abs/2105.15203
Code:
1) Motivation, Objectives and Related Works:
Motivation:
Objectives:
Related Works:
Contribution:
2) Methodology:
Architecture Overview:
Input Size: HxWx3
Patch Size: 4x4 (keep feature of each pixel, in ViT: 16x16)
Architecture: hierarchical Transformer
Encoder: MixTransformer (MiT) ==> multi-scale features (1/4, 1/8, 1/16, 1/32)
Decoder: All MLP Layer ==> Segmentation map (H/4, W/4, N_cls), then Up-sampling 4x.
Encoder: 5 versions: MiT B0 - MiT B5
Hierarchical Feature Representation:
Include 4 blocks
Output: H/2i+1 x W/2i+1 x Ci with Ci+1 > Ci, i = {1,2,3,4}
Overlapped Patch Merging:
Preserve local information between patchs.
Parameters: K - patch size; S - stride; P - padding
K = 7, S = 4, P = 3 for Block 1, and K = 3, S = 2, P = 1 for Block 2, 3, 4.
Efficient Self-attention:
Multiple Head
Each head has Q, K, V with the same size (N, C), with N = WxH length of the sequence.
The formula of Self-Attention - complexity O(N2):
To make it more efficient, using a parameter R - to reduce the dimension of sequence.
The complexity is reduced to O(N2/R).
In practice, R = 64, 16, 4, 1 on 4 blocks.
Mix-FFN:
Motivation: The problem of positional embedding - with different-size train and test images, we need to use the "interpolate" algorithm to add more positional embedding.
Objective: Mix-FFN combine Convolution with Feed Forward Network without Positional Embedding.
xin is the output of the self-attention layer.
In practice, using Depth-wise and Conv(3x3)
Decoder: Lightweight All-MLP Decoder
Decoder is just a MLP layer (or Conv with kernel size = 1)
Steps:
Output from 4 blocks is passed through an MLP to have the same channel size.
Then, they are up-sampled to (H/4, W/4), and are concatenated.
An MLP is applied to reduce the channel size to 256.
Lastly, another MLP takes the 256-d feature maps as input and outputs the new set of feature maps (H/4, W/4, Ncls). This one is then up-sampled 4 times to match (H, W, Ncls)
Formula:
3) Experimental Results:
Experimental Results:
Ablations:
This strategy can only obtain larger tokens, and cannot produce tokens smaller than base tokens. (as MST paper).
References:
n2 n0
θ