Masked Image Models

Propose a simple yet effective method to pretrain large vision models (here ViT Huge). Inspired by the pretraining algorithm of BERT (Devlin et al.), they mask patches of an image and, through an autoencoder predict the masked patches. In the spirit of "masked language modeling", this pretraining task could be called "masked image modeling".

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Methodology

Masked Image Models

Generative approaches

Predicting masked inputs by surrounding data is the earliest self-supervised method category. The idea actually can be traced back to the quote, “You shall know a word by the company it keeps.“ — John Rupert Firth (1957), a linguist.
This series of algorithms started from word2vec (ref) in the text field in 2013. The concept of the continuous bag of words (CBOW) of word2vec is predicting a central word by its neighbors, which is very similar to ELMo (ref) and the masked language modeling (MLM) of BERT (ref). These models were all categorized as non-autoregressive generative approaches. The main differences were that later models used more advanced structures like bidirectional LSTM (for ELMo) and transformer (for BERT), and the recent models generated contextual embeddings.
In the speech field, Mockingjay (ref) masked all the dimensions of consecutive features, and TERA (ref) masked the specific subset of dimensions of features. In the image field, OpenAI applied the regime of BERT (ref). In the graph field, GPT-GNN also masked attributes and edges (ref). These methods all masked partial input data and tried to predict them back.
On the other hand, another generative approach is to predict the next token/pixel/acoustic feature. In the text field, GPT series models (ref & ref) are the pioneers in this category. APC (ref) and ImageGPT (ref) applied the same idea in the speech and image fields respectively. Interestingly, because adjacent acoustic features are so easy to predict, the model usually is requested to predict the token in the later sequence (at least 3 tokens away).
The great successes of self-supervised learning (especially BERT/GPT) motivated researchers to apply similar generative approaches to other fields like image and speech. However, for image and speech data, it’s harder to generate the masked inputs since choosing a limited amount of text tokens is easier than choosing an unlimited amount of image pixels /acoustic features. The performance improvements were not as good as the text field. Therefore, the researchers also developed many other non-generative approaches in the following sessions.

Key Takeaways

References

Masked Image Models

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Methodology

Masked Image Models

Generative approaches

Key Takeaways

References

About Me: