0) Motivation, Object and Related works:
Motivation:
The aim of the pretext task is to identify the relationship between different patches in the image using self-supervised learning.
Objectives:
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation.
Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images.
For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the RCNN framework [21] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-theart performance among algorithms which use only Pascal provided training set annotations.
Training Algorithm:
Sample a random patch from the image.
Nearest Neighbor: Assuming that the first patch is placed in the middle of a 3×3 grid, the second patch is sampled from its 8 neighbouring locations.
Introduce augmentations such as gaps between patches, chromatic aberration, downsampling, and upsampling of patches to handle pixelation and colour jitters. This helps the model not overfit certain low-level signals.
The aim of the task is to identify which of the 8 neighbouring positions is the second patch. The task is framed as a classification problem over 8 classes.
While finalizing the pretext task, it is important to make sure it is not learning trivial patterns as compared to high-level latent features underlying global patterns. For instance, low-level cues like boundary textures between patches can be considered trivial features. However, for some images there exists a trivial solution. This happens due to a camera lens effect called chromatic aberration which occurs due to differences in the focus of light at different wavelengths.
The convolutional neural networks are capable of learning the relative location of the patches by detecting the difference between magenta(blue+red) and green. The nearest-neighbor experiments proved that few patches retrieved regions from the absolute same location because patches displayed similar aberration.