Generative Pretraining from Pixels (Image GPT),

uses transformer for pixel level image completion, just like other GPT for text completion

In this post, I summary the ideas from a new paper from Google Brain.

Image GPT is a GPT-2 transformer based model that has been trained on pixel sequence to generate image completion and samples. Like a general pre-trained language model, it is designed to learn high-quality unsupervised image representations. It can predict the next pixel auto-regressively without any knowledge of the 2D structure of the input image.
Features from the pre-trained image GPT achieved state-of-the-art performance on a number of classification benchmark and near state-of-the-art unsupervised accuracy on ImageNet.
- Use same transformer architecture as GPT-2 in natural language text
- Unsupervised learning without human labeling
- Need more compute to generate competitive representations
- Learned features achieved SOTA performance on classification benchmark with low resolution dataset

The following image shows the model generated completion with human provided half image as input, followed by the creative completions from the model.

The above shows DETR, a hybrid pipeline that uses CNN and Transformer as the main building blocks in the pipeline. Here is the flow:

CNN is used to learn 2D representation of an image and extract the features.

The output of the CNN is flattened and supplemented with positional encodings to feed into standard Transformer’s encoders.

The Transformer’s decoder passes the output embeddings to a feed forward network (FNN) for predicting the class and bounding box.

Page updated

Google Sites

Report abuse

Generative Pretraining from Pixels (Image GPT),

uses transformer for pixel level image completion, just like other GPT for text completion

About Me: