[SimMIM] SimMIM: A Simple Framework for Masked Image Modeling

Tsinghua University, Microsoft Research Asia, and Xi’an Jiaotong University

{Self-Supervised Learning}

Paper: https://arxiv.org/abs/2111.09886

Code:

Motivation, Objectives and Related Works

Motivation

This is a study of pre-training by using transformers to perform the task of predicting masked areas of an image using regression.
It has a very simple structure and can outperform existing self-supervised learning methods such as DINO.

Objectives

Like MAE, SimMIM masks the image.
SimMIM also includes the masked image in the Encoder, and uses it as a direct prediction mechanism.
1. Random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task.
2. Predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs.
3. The prediction head can be as light as a linear layer, with no worse performance than heavier ones.

Related Works

Model

Idea

Steps

Model

Data Augmentation

Encoder

Swin Transformer.
ViT.

Masking Strategy

A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4 ∼ 32×32, and 32×32 is adopted by default which is the patch size of the last stage.
For ViT, 32×32 is adopted as the default masked patch size.

Decoder

Head
1. - One-layer Prediction Head.
  - In this paper, the prediction head is made extremely lightweight, as light as a linear layer.
  - Heavier heads are also tried such as a 2-layer MLP, an inverse Swin-T, and an inverse Swin-B.
2. Targets
  - In iGPT, pixel clustering is performed, and iGPT is to predict the cluster of the pixel.
  - In BEiT, visual tokens are predicted.
  - In this paper, raw pixel value regression is performed.

Loss Function

An ℓ1-loss is employed on the masked pixels:

Masking Strategy

A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
1. For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4∼32×32, and 32×32 is adopted by default which is the patch size of the last stage.
2. For ViT, 32×32 is adopted as the default masked patch size.

Figure. Illustration of masking area generated by different masking strategies using a same mask ratio of 0.6 on different patch sizes (e.g., 4, 8, 16 and 32).

Training Strategy

Experimental Results

Dataset

Metrics

Experimental Results

Ablations

Transformer-based decoders like MAE (e.g. Swin-T) have been tried, the simplest linear model outperformed in accuracy and computational cost.

Key Takeaways

SimMIM doesn’t have a detailed breakdown, but in short, it’s a variation of MAE where the decoder is simply composed of one or two layers of MLP and performed well.

References

- n2 n0
- θ

Page updated

Google Sites

Report abuse

[SimMIM] SimMIM: A Simple Framework for Masked Image Modeling

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Model

Idea

Steps

Model

Data Augmentation

Encoder

Masking Strategy

Decoder

Loss Function

Masking Strategy

Training Strategy

Experimental Results

Dataset

Metrics

Experimental Results

Ablations

Key Takeaways

References

About Me: