[DIM] Deep InfoMax

{Local-Global Structure, Mutual Information Maximization}

Code:

Figure 1. The contrastive task in Deep InfoMax.

Motivation, Objectives and Related Works

Motivation:

Objectives:

Related Works:

Deep InfoMax learns representations of images by leveraging the local structure present in an image.
The contrastive task behind DIM is to classify whether a pair of global features and local features are from the same image or not.
1. Global features are the final output of a convolutional encoder (a flat vector, Y).
2. Local features are the output of an intermediate layer in the encoder (an M x M feature map). Each local feature map has a limited receptive field.
The contrastive task: the global feature vector must capture information from all the different local regions.

DIM does both tasks of estimating and maximizing the mutual information between input data and high-level representations in a concurrent way.
Notably, DIM can choose whether local or global information is prioritized to adapt the learned representations for classification or reconstruction tasks.
Adversarial learning is also used to keep the statistical characteristics of the representation the same as those of a specific prior.

An MxM feature map is first got by feeding the input image through a convolutional network. Then, all vectors along the depth dimension of the MxM feature map are summarized into a single high-level features vector.
Their goal is that the trained model can produce high-level feature vectors that still contain useful information from input.

Figure 2. The base encoder model in the context of image data.

Find the set of parameters/weights so that I(X, E(X)) is maximized. The input X can be a complete image or just a local patch.

Figure 3. Basic mutual information maximization framework: Deep InfoMax (DIM) with a global MI(X;Y) objective.

In the first branch, both the MxM feature map and the feature vector Y, which are global features of the same image, are fed through a discriminator to have a score for “real”.
Similarly, if we change the original MxM feature map to the MxM feature map of another image and feed it with the same feature vector Y as above into a discriminator, we will have a score for “fake”.

Figure 4. Their local DIM framework. Maximizing mutual information between local features and global features.

It is the same as Figure 3 except lower-level local feature vectors along the temporal dimension of the MxM feature map are sequentially combined with the global feature vector Y.
The result is an MxM matrix of which each element is a score of a local-global pair.

Dataset:

Metrics:

Experimental Results:

Ablations:

DIM has been extended to other domains such as graphs (Veličković et al., 2018), and RL environments (Anand et al., 2019). A follow-up to DIM, Augment Multiscale DIM (Bachman et al., 2019) now achieves a 68.4% Top-1 accuracy on ImageNet with unsupervised training when evaluated with the linear classification protocol.Loss Function

Page updated

Google Sites

Report abuse