The main idea is to design more simplified goals/targets to avoid data generation. The most crucial and challenging point is that the task needs to be at an appropriate difficulty level for the model to learn.
NLP: Both BERT and ALBERT predicted whether the next fragment is the next sentence.
BERT provided negative training samples by randomly swapping the next fragment with another fragment (next-sentence prediction; NSP)
ALBERT provided negative training samples by swapping the previous and the next fragment (sentence-order prediction; SOP).
SOP has been shown to outperform NSP (ref).
It is so easy to distinguish random sentence pairs by the topic prediction that the model didn’t learn much from the NSP task;
SOP allows the model to learn the coherence relationship. As a result, it needs domain knowledge to design good tasks and experiments to validate the task efficiency.
The idea of predicting context like SOP was also applied to the image field (as predicting the relative location of the image patches (ref)) and the speech field (as predicting the time interval between two acoustic feature groups (ref)).
In the image field, DeepCluster applied k-means clustering (ref).
In the speech field, HuBERT applied k-means clustering (ref) and BEST-RQ employed a random-projection quantizer (ref).
Predicting the gray-scale channel by the color channels of images (and vice versa; ref).
Reconstructing the random cropping patch of the images (i.e., inpainting; ref).
Reconstructing the images of original resolution (ref).
Predicting the rotation angle of the images (ref).
Predicting the colors of images (ref1, ref2, ref3).
Solving the jigsaw puzzle (ref).
Applies a transformation to the input image and requires the learner to predict properties of the transformation from the transformed image in order to learn image representations from the pixels themselves without relying on pre-defined semantic annotations.
Learning image representations from the pixels themselves without relying on pre-defined semantic annotations.
This is done via a pretext task that applies a transformation to the input image and requires the learner to predict properties of the transformation from the transformed image (see Figure 1).
Examples of image transformations used include rotations [20], affine transformations [33, 57, 65, 85], and jigsaw transformations [54].
As the pretext task involves predicting a property of the image transformation, it encourages the construction of image representations that are covariant to the transformations.
Although such covariance is beneficial for tasks such as predicting 3D correspondences [33, 57, 65], it is undesirable for most semantic recognition tasks. Representations ought to be invariant under image transformations to be useful for image recognition [14, 31] because the transformations do not alter visual semantics. In fact, invariance is one of the core tenets of designing ‘good’ features [8, 45, 48].