Segformer

{Scaled self-attention, }

Code:

1) Motivation, Objectives and Related Works:

Motivation:

Objectives:

Related Works:

Contribution:

2) Methodology:

Architecture Overview:

Input Size: HxWx3
Patch Size: 4x4 (keep feature of each pixel, in ViT: 16x16)
Architecture: hierarchical Transformer
Encoder: MixTransformer (MiT) ==> multi-scale features (1/4, 1/8, 1/16, 1/32)
Decoder: All MLP Layer ==> Segmentation map (H/4, W/4, N_cls), then Up-sampling 4x.

Encoder: 5 versions: MiT B0 - MiT B5

Hierarchical Feature Representation:
- Include 4 blocks
- Output: H/2i+1 x W/2i+1 x Ci with Ci+1 > Ci, i = {1,2,3,4}
Overlapped Patch Merging:
- Preserve local information between patchs.
- Parameters: K - patch size; S - stride; P - padding
- K = 7, S = 4, P = 3 for Block 1, and K = 3, S = 2, P = 1 for Block 2, 3, 4.
Efficient Self-attention:
- Multiple Head
- Each head has Q, K, V with the same size (N, C), with N = WxH length of the sequence.
- The formula of Self-Attention - complexity O(N2):

- To make it more efficient, using a parameter R - to reduce the dimension of sequence.
- The complexity is reduced to O(N2/R).
- In practice, R = 64, 16, 4, 1 on 4 blocks.

Mix-FFN:

Motivation: The problem of positional embedding - with different-size train and test images, we need to use the "interpolate" algorithm to add more positional embedding.
Objective: Mix-FFN combine Convolution with Feed Forward Network without Positional Embedding.

Decoder: Lightweight All-MLP Decoder

Decoder is just a MLP layer (or Conv with kernel size = 1)
Steps:
1. Output from 4 blocks is passed through an MLP to have the same channel size.
2. Then, they are up-sampled to (H/4, W/4), and are concatenated.
3. An MLP is applied to reduce the channel size to 256.
4. Lastly, another MLP takes the 256-d feature maps as input and outputs the new set of feature maps (H/4, W/4, Ncls). This one is then up-sampled 4 times to match (H, W, Ncls)
Formula:

3) Experimental Results:

Experimental Results:

Ablations:

Key Takeaways

This strategy can only obtain larger tokens, and cannot produce tokens smaller than base tokens. (as MST paper).

References:

Page updated

Google Sites

Report abuse