[AMDIM] Learning Representations by Maximizing Mutual Information Across Views
{Comparing Representation across Views, Learning Invariances}
Paper: https://www.microsoft.com/en-us/research/uploads/prod/2019/07/AMDIM-NeurIPS-updated.pdf
{Comparing Representation across Views, Learning Invariances}
Paper: https://www.microsoft.com/en-us/research/uploads/prod/2019/07/AMDIM-NeurIPS-updated.pdf
Figure. Different Meanings of Views (multiple positions of view, different modalities, or even samples from different types of augmentations)
Left: AMDIM learns representations that are invariant across data augmentations such as random-crop.
Right: CMC learns representations that are invariant across different channels of an image.
Propose an approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context.
Multiple views of a local spatiotemporal context can be produced by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual).
An ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.
Maximizing mutual information between features extracted from these views requires capturing information about high-level factors whose influence spans multiple views – (e.g., the presence of certain objects or the occurrence of certain events).
Develop a model that learns image representations that significantly outperform prior methods on the tasks we consider.
AMDIM is built on the basis of local Deep InfoMax with 3 extensions:
Features are predicted across independently augmented forms of each input.
Features are simultaneously predicted across multiple scales.
A more powerful encoder is used.
Self-supervised learning for computer vision.
Doersch et al. [2015] and Noroozi and Favaro [2016] learn representations by learning to predict/reconstruct the spatial structure.
Zhang et al. [2016] introduce the task of predicting color information that has been removed by converting images to grayscale.
Gidaris et al. [2018] propose learning representations by predicting the rotation of an image relative to a fixed reference frame, which works surprisingly well.
Self-supervised learning by maximizing mutual information between features extracted from multiple views of a shared context.
Features extracted from a video with most color information removed and features extracted from the original full-color video.
Vondrick et al. [2018]: Object tracking can emerge as a side-effect of optimizing this objective in the special case where the features extracted from the full-color video are simply the original video frames.
Consider predicting how a scene would look when viewed from a particular location, given an encoding computed from several views of the scene from other locations.
Eslami et al. [2018]: requires maximizing mutual information between features from the multi-view encoder and the content of the held-out view.
The general goal is to distill information from the available observations such that contextually-related observations can be identified among a set of plausible alternatives.
[Arandjelovic and Zisserman, ´2017, 2018]: Considers learning representations by predicting cross-modal correspondence.
While the mutual information bounds in [Vondrick et al., 2018, Eslami et al., 2018] rely on explicit density estimation, our model uses the contrastive bound from CPC [van den Oord et al., 2018], which has been further analyzed by McAllester and Stratos [2018], and Poole et al. [2019].
The key idea is that maximizing mutual information between features extracted from multiple views of a shared context forces the features to capture information about higher-level factors (e.g., presence of certain objects or occurrence of certain events) that broadly affect the shared context.
Uses standard data augmentation techniques as the set of transformations a representation should be invariant to.
Comparing representations across views from feature maps extracted from intermediate layers of a convolutional neural network (CNN).
1) Multiple views of an image:
Figure. Generate Multiple Views by Applying Augmentation 2 times.
This idea was actually proposed in 2014 via this paper by Dosovitski et al. The idea is to use a “seed” image to generate many versions of the same image.
2) Intermediate layers of a CNN:
Use multi-scale receptive fields to make comparisons across spatial scales.
Random resized crop.
Random jitter in color space.
Random grayscale transformation.
Standard ResNet with some modifications so that it can be fitted for DIM.
The main problem is receptive fields.
The receptive fields are adjusted in a way that two features in a positive sample pair do not intersect by a large amount.
In other words, the two features should not be the ones that are extracted from the too similar region of the input image because this will make it easier to get the mutual information between them.
Padding is avoided to keep the feature distributions stable.
Figure. Encoder used in AMDIM and CPC
Dot product.
Cosine similarity.
The authors of Amdim maximize the NCE lower bound by minimizing the loss below:
where N7 is a set of negative samples and LΦ is a softmax function:
f and Φ are two parametric functions that need to be calculated. These functions are deep neural networks and parameters are their weights.
Concretely, Φ maps a pair of (antecedent features, consequent features) into a single scalar value/a score. The higher score leads to the higher possibility that this is a positive pair which means both the antecedent and consequent are extracted from the same sample.
In the numerator is the exponential of the score of a pair of antecedent and consequent (can be positive or negative).
In the denominator, f~7 is obtained from the union set of negative samples and positive samples, then the sum of exponentials is calculated.
Their goal is to make this fraction high in the case of the positive pair and low in the case of the negative pair. The more this becomes true, the lower the loss function will be.
Transforming for the case of augmented features (features of augmented samples), the loss will now become:
x1 is an augmented version of the original input x, x2 is another augmented version of the original input x.
We can also infer from this loss is that the pair does not need to be constructed from the global feature and the local feature of the same sample, but it can be of two samples coming from different types of augmentation.
Figure. Local DIM with predictions across views generated by data augmentation.
Multiscale means the multi-level prediction in which each element of a feature pair can be extracted from different layers of a network.
Features at every step in a forward pass of one view can transfer mutual information with those of another.
Figure. Augmented Multiscale DIM, with multiscale infomax across views generated by data augmentation
An algorithm for efficient NCE with mini batches of na images, comprising one antecedent and nc consequents per image.
For each true (antecedent, consequent) positive sample pair, we compute the NCE bound using all consequents associated with all other antecedents as negative samples.
Dynamic programming is utilized in the log-softmax normalizations required by lnce.
CIFAR10.
CIFAR100.
STL10.
ImageNet.
Places205
Table 1: (a): Comparing AMDIM with prior results for the ImageNet and Imagenet→Places205 transfer tasks using linear evaluation.
The (sup) models were fully-supervised, with no self-supervised costs.
The small and large AMDIM models had size parameters: (ndf=192, nrkhs=1536, ndepth=8) and (ndf=320, nrkhs=2560, ndepth=10).
AMDIM outperforms prior and concurrent methods by a large margin.
Table 2: (b): comparing AMDIM with fully-supervised models on CIFAR10 and CIFAR100, using linear and MLP evaluation.
The small and large AMDIM models had size parameters: (ndf=128, nrkhs=1024, ndepth=10) and (ndf=256, nrkhs=2048, ndepth=10).
AMDIM features performed on par with classic fully-supervised models.
Table 3: (c): Results of single ablations on STL10 and ImageNet.
The size parameters for all models on both datasets were: (ndf=192, nrkhs=1536, ndepth=8).
Our strongest results used the Fast AutoAugment augmentation policy from Lim et al. [2019], and we report the effects of switching from basic augmentation to stronger augmentation as “+strong aug”.
Data augmentation had the strongest effect by a large margin, followed by stability regularization and multiscale prediction.
SimCLR, Moco, BYOL, and Swav can be viewed as variants of AMDIM.
The choice of the encoder does not matter as long as it is wide.
The representation extraction strategy does not matter as long as the data augmentation pipeline generates good positive and negative inputs.
Compare CPC and DIM:
Similarities:
Both of them maximize the mutual information between global and local information.
They share some motivations and computations.
Differences:
In CPC, the local features are handled in an iterative and accumulative way to build “summary features”. From each “summary feature”, specific local features in the future can be predicted. This is similar to ordered autoregression.
The basic DIM just uses a single summary feature which is computed from all local features. And this feature is responsible for simultaneously predicting all local features using only one estimator.
Compare DIM and AMDIM:
Similarities:
AMDIM is built on the basis of local DIM.
Differences:
The idea of AMDIM is maximizing mutual information between features extracted from multiple views of the input rather than between features of the input and output.
AMDIM has the procedure of multiscale mutual information.
AMDIM has a stronger encoding network.
Compare CPC and AMDIM
Similarities:
AMDIM uses the same contrastive bound as CPC.
Differences:
After it performs the standard transforms (jitter, flip, etc…), AMDIM generates two versions of an image by applying the data augmentation pipeline twice to the same image.
AMDIM is based on DIM so the differences between it and CPC can also be infered from the two parts above.
n2 n0
θ