Generative Pretraining from Pixels (Image GPT),
uses transformer for pixel level image completion, just like other GPT for text completion
In this post, I summary the ideas from a new paper from Google Brain.
Image GPT is a GPT-2 transformer based model that has been trained on pixel sequence to generate image completion and samples. Like a general pre-trained language model, it is designed to learn high-quality unsupervised image representations. It can predict the next pixel auto-regressively without any knowledge of the 2D structure of the input image.
Features from the pre-trained image GPT achieved state-of-the-art performance on a number of classification benchmark and near state-of-the-art unsupervised accuracy on ImageNet.
Use same transformer architecture as GPT-2 in natural language text
Unsupervised learning without human labeling
Need more compute to generate competitive representations
Learned features achieved SOTA performance on classification benchmark with low resolution dataset