[MAE] Masked Autoencoders Are Scalable Vision Learners
{, }
Code: https://github.com/facebookresearch/mae
Video: https://www.youtube.com/watch?v=Dp6iICL2dVI
Application:
{, }
Code: https://github.com/facebookresearch/mae
Video: https://www.youtube.com/watch?v=Dp6iICL2dVI
Application:
A huge number of data can be seen in NLPs easily.
Autoregressive language modeling in GPT and masked autoencoding in BERT are not complex: they delete a percentage of the data and learn to predict the removed content.
These methods make the training of NLP models, including billions of billion parameters, viable.
MAE is a scalable self-supervised learner for computer vision that divides the image into patches and performs the task of predicting the masked parts of the image as pre-training.
An asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Masking a high proportion of the input image, e.g., 75%, generates a nontrivial and meaningful self-supervisory task.
Enables to train large models efficiently and effectively: accelerate training (by 3× or more) and improve accuracy.
These are successful models that have been used for pre-training in NLP.
These feed the sequential inputs to forecast the missing contents. Also, these are scalable too.
Peers such as BERT, GPT, etc.
This classical method includes two main parts:
Encoder (which maps an input to a latent representation).
Decoder (which rebuilds the input).
Some examples: PCA, K-means, DAE (Denoising AutoEncoder), etc.
These learn representations from images.
This models image similarity and dissimilarity between two or more views.
These are firmly connected with data augmentation.
Figure. BeIT
The proposed MAE in this research is not complex; a simple autoencoder that uses partial observation (the input image is not complete) and then turns out the image is entire. This autoencoder is almost like other previous (classical) autoencoders except for its asymmetric architecture, which is different from others. This design lets the model not train on all pixels of the image.
The input image I ∈ R3xHxW is split into patches (ex: 16x16) like ViT, and 75% of it is masked at a high rate.
The unmasked patches are input and passed through the encoder Φenc, then combined with the masked patches and passed through the decoder Φdec.
The goal is to restore the original image as closely as possible. Each patch carries out this process, so each patch can be seen as learning semantic information.
Hence, this method is termed the “token-level approach”.
Produce a token for every patch (by linear projection with an added positional embedding)
Shuffle the list of tokens randomly then, delete the last portion of the list (based on the masking ratio). This process generates a small subset of tokens (sampling patches with no replacement)
After the encoding, a list of mask tokens is added to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to be equal with their targets.
Researchers split an image into regular non-overlapping patches, then sampled a subset of patches and masked the rest (e.g. remove).
The strategy (random sampling) used in this research is direct: random sample patches (with no replacement), following a uniform distribution (which avoids a potential center bias).
The result of a high masking ratio (the ratio of patches removal = 75%) considerably wipes out the plenty, therefore creating a task that cannot be simply solved by extrapolation from visible neighboring patches.
ViT (Vision Transformer).
The encoder is just applied on visible, unmasked patches, so the advantage seems to be that we can use a huge model while saving memory.
The encoder in this research embeds patches by the use of a leaner projection with a positional embedding and then operates the resulting set by a series of Transformer blocks.
Decoder
The decoder also uses lightweight transformers, which is much lighter than the encoder, and each token requires less than 10% of the computation of the encoder.
The decoder is only used to pre-train the mask partial reconstruction.
The output of the decoder is a vector of pixel values representing a patch.
The final layer of the decoder is a linear projection.
Mean Squared Error between masked tokens and reconstructed tokens in pixel space.
p là token index, Ω là tập các token được mask, I là ảnh đầu vào, I^ là ảnh được reconstruct.
First, let’s look at the image reconstruction task (pre-training task). In this experiment, the results are the image reconstruction of the ImageNet validation set. We can see that the image is successfully reconstructed even though 80% of the image is masked.
Next, let’s take a look at the effect of mask ratio. The figure below shows an experiment of mask ratio and accuracy. We can see that the higher the mask ratio is, the better the results are in the downstream task, the image classification task.
Next, let’s take a look at the results of the downstream tasks. The first one is image classification. It gives excellent results compared to the self-supervised learning method using ViT.
Finally, there is object detection and semantic segmentation. This one also outperforms existing self-supervised learning methods and supervised learning.
n2 n0
θ