[SimMIM] SimMIM: A Simple Framework for Masked Image Modeling
Tsinghua University, Microsoft Research Asia, and Xi’an Jiaotong University
{Self-Supervised Learning}
Paper: https://arxiv.org/abs/2111.09886
Code:
Tsinghua University, Microsoft Research Asia, and Xi’an Jiaotong University
{Self-Supervised Learning}
Paper: https://arxiv.org/abs/2111.09886
Code:
This is a study of pre-training by using transformers to perform the task of predicting masked areas of an image using regression.
It has a very simple structure and can outperform existing self-supervised learning methods such as DINO.
Like MAE, SimMIM masks the image.
SimMIM also includes the masked image in the Encoder, and uses it as a direct prediction mechanism.
Random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task.
Predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs.
The prediction head can be as light as a linear layer, with no worse performance than heavier ones.
Swin Transformer.
ViT.
A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4 ∼ 32×32, and 32×32 is adopted by default which is the patch size of the last stage.
For ViT, 32×32 is adopted as the default masked patch size.
Head
One-layer Prediction Head.
In this paper, the prediction head is made extremely lightweight, as light as a linear layer.
Heavier heads are also tried such as a 2-layer MLP, an inverse Swin-T, and an inverse Swin-B.
Targets
In iGPT, pixel clustering is performed, and iGPT is to predict the cluster of the pixel.
In BEiT, visual tokens are predicted.
In this paper, raw pixel value regression is performed.
An ℓ1-loss is employed on the masked pixels:
A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4∼32×32, and 32×32 is adopted by default which is the patch size of the last stage.
For ViT, 32×32 is adopted as the default masked patch size.
Transformer-based decoders like MAE (e.g. Swin-T) have been tried, the simplest linear model outperformed in accuracy and computational cost.
SimMIM doesn’t have a detailed breakdown, but in short, it’s a variation of MAE where the decoder is simply composed of one or two layers of MLP and performed well.
n2 n0
θ