[CMC] Contrastive Multiview Coding
{}
Humans view the world through many sensory channels.
Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt).
We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors.
The framework of multiview contrastive learning: learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact.
Our approach scales to any number of views and is view-agnostic.
The more views the model learns from, the better the resulting representation captures underlying scene semantics.
Uses different views of the same image (depth, luminance, luminance, chrominance, surface normal, and semantic labels) as the set of transformations the representation should be invariant to.
Two pairs of images are input. Each image passes through its encoder.
Each latent vector ‘z’ is mapped to the Feature Space.
If the two images come from the same source or label, they are termed a Positive sample and are mapped closely. If they come from different sources or labels, they are termed a Negative sample and are mapped distantly.
This process aims to cluster compressed representations that are similar, capturing the overall patterns and characteristics of the image, hence it’s termed the “Image-level approach”.