[MoCo] Momentum Contrast for Unsupervised Visual Representation Learning
{Scaling the number of negative examples (MoCo), Momentum Encoder}
{Scaling the number of negative examples (MoCo), Momentum Encoder}
Comparison of different strategies of using negative samples in contrastive methods. Here xq are positive examples, and xk are negative examples. Note that the gradient doesn’t flow back through the momentum encoder in MoCo.
Moco does the same thing as AMDIM (with the last feature map only) but keeps a history of all the batches it has seen and increases the number of negative samples.
The effect is that the number of negative samples used to provide a contrastive signal increases beyond a single batch size.
InfoNCE
Momentum Contrast(MoCo) on the other hand, keeps a separate buffer of negatives (as high as 8k) and uses them for calculating the InfoNCE loss. This allows them to train MoCo with smaller batch sizes without compromising accuracy.
MoCo keeps all recent mini-batches in a fixed-size buffer for negatives. To achieve superior results, a momentum encoder(Θ_k) is used which has the same architecture as the encoder (Θ_q) but the weights are slowly moving towards the actual encoder.
Contrastive methods tend to work better with a larger number of negative examples, since presumably larger number of negative examples may cover the underlying distribution more effectively and thus give a better training signal. In the usual formulation of contrastive learning, the gradients flow back through the encoders of both the positive and negative samples. This means that the number of negative samples is restricted to the size of the mini-batch.
Momentum Contrast (MoCo, He et al., 2019) gets around this effectively by maintaining a large queue of negative samples, and not using backpropagation to update the negative encoder. Instead it periodically updates the negative encoder using a momentum update:
Here, θ_k denotes the weights of the encoder for negative examples, and θ_q denotes the weights of the encoder for positive examples.
The only role of the momentum encoder is to generate representations (k_0, k_1, k_2) out of the negative samples. Mind that momentum encoder does not update the weights through backpropagation which makes the method more memory efficient and allows to keep a large buffer of negatives in memory.
Now in a short summary, given the first input x_query, the encoder generates the representations q, which is matched against another augmented version of the same image x_query (not shown in the image below) and also matched with the N negatives provided by the momentum encoder. Then the loss term is calculated using InfoNCE loss described in the CPC section.
In the second version of MoCo, the representations attained 71.1% accuracy on the Imagenet under linear evaluation protocol which went further close to the supervised Resnet-50 model(76.5%).