Masked Image Models

Propose a simple yet effective method to pretrain large vision models (here ViT Huge). Inspired by the pretraining algorithm of BERT (Devlin et al.), they mask patches of an image and, through an autoencoder predict the masked patches. In the spirit of "masked language modeling", this pretraining task could be called "masked image modeling".