[CPC] Representation Learning with Contrastive Predictive Coding
{Predicting the Future}
Paper: https://arxiv.org/pdf/1807.03748.pdf
Code:
{Predicting the Future}
Paper: https://arxiv.org/pdf/1807.03748.pdf
Code:
Illustration of the contrastive task in CPC wih an audio input
From bottom to top, the first thing is the audio input of which its vectors are represented as x. g_enc is a non-linear encoder that receives the sequence of observations x_t as input and produces a sequence of representation z_t, possibly with a smaller temporal size. Next, g_ar, which is an autoregressive model, summarizes all z(<=t) in the latent space and produces a context latent representation c_t.
While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
Propose a universal unsupervised learning approach to extract useful representations from high-dimensional data. CPC - Contrastive Predictive Coding.
Key insight:
Powerful autoregressive models: learn such representations by predicting the future in latent space.
Probabilistic contrastive loss: induces the latent space to capture information that is maximally useful to predict future samples.
Negative sampling: makes the model tractable.
Illustration of training CPC model.
Let {x1,x2,…,xN} be a sequence of data points, and x_t be an anchor data point. Then, x_(t+k) will be a positive sample for this anchor. A data point x_t∗ randomly sampled from the sequence will be a negative sample.
First divide the whole image into a coarse grid.
Given the upper few rows of the image, the task is to predict the lower rows of the same image.
The model has to learn the structure of the object in the image (for example seeing the face of a dog, the model should predict that it would have 4 legs) => give us a useful representation for downstream tasks.
Fig. Divide the Image to a Coarse Grid
CPC can generate many sets of positive and negative samples. In practice, this process is applied to a batch of examples where we can use the rest of the examples in the batch as the negative samples.
Fig. Generating positive, anchor, and negative pairs from a batch of images. (Batch size = 3).
Steps:
First, the embedding vectors are extracted from the high-dimensional input by a non-linear encoder.
Powerful autoregressive models are used to accumulate the information (context latent representation) from the latent vectors through time.
These summary features are then used to (directly) predict the future latent vectors with the purpose of reducing processing time.
They build the loss function on the Noise-Contrastive Estimation, which allows us to train the model end-to-end.
Details:
Given the 256 x 256 image, divide it into a 7x7 grid with each cell of size 64px and 32px overlap with neighboring cells.
Use an encoder model (g_enc) such as Resnet-50 to encode each grid cell into a 1024-dimension vector. The whole image is now transformed into a 7x7x1024 tensor.
Given the top 3 rows of the transformed grid (7x7x1024 tensors), generate the last 3 rows of it (i.e. 3x3x1024 tensor).
An auto-regressive generative model g_ar (PixelCNN for instance) is used for predicting the bottom 3 rows.
PixelCNN creates a context vector c_t from the given 3 top rows and sequentially predicts the bottom rows (z_t+2, z_t+3, z_t+4 from the figure below).
Encoder
The Encoder is as below table:
Use a contrastive loss which makes maximally useful information be captured by the latent space.
With the contrastive loss, the model also becomes tractable by using negative sampling.
To train this model effectively, a loss function is required to enforce the similarity between positive pairs(correct patch prediction) and negative pairs(incorrect patch). For calculating the loss, the set X of N patches is used where X is the set of N-1 negative samples and 1 positive sample(correct path). The N-1 negatives are sampled randomly from all available patches of the same image(expect the correct patch) and different images in the batch. This loss is termed as InfoNCE loss where NCE stands for Noise Contrastive and it is shown below.
InfoNCE loss
Here q is the network prediction, k+ is the positive patch(correct patch) and k- represents a set of N-1 negative patches. Note that k+, k- and q, all are in representation space i.e. output of g_enc and not into original image space.
In simple terms, the formula is equivalent to the log_softmax function. To calculate the similarity, the dot product is used. Take a dot product of all N samples with the prediction q and then calculate the log of softmax of the similarity score of the positive sample with the prediction q.
In order to validate the richness of the representations learnt by CPC, a linear evaluation protocol is used. A linear classifier is trained on top of the output of the frozen encoder model(g_enc) using the Imagenet dataset and then it is evaluated for the classification accuracy of the learnt classifier model on the Imagenet Val/Test set. Note that during this whole training process of the linear classifier, the backbone model(g_enc) is fixed and is not trained at all. The table below shows that the classification accuracy of CPC representations outperformed all the other methods introduced before CPC with 48.7% top-1 acc.
Although CPC outperformed other unsupervised learning methods for representation learning, the classification accuracy was still very far from the supervised counterpart(Resnet-50 with 100% labels on the Imagenet has 76.5% top-1 accuracy). This idea of image crop discrimination was extended to instance discrimination and tightened the gap between self-supervised learning and supervised learning methods.
Recent work (Hénaff et al., 2019) has scaled up CPC and achieved a 71.5 % top-1 accuracy when evaluated with linear classification on ImageNet.