Bootstrapping
Researchers further developed bootstrapping approaches to avoid using negative examples since it’s computationally intensive for training and not easy to select good negative examples. The key ideas of bootstrapping approaches are 1) to generate a positive pair of samples from two augmentations of the same original sample (just like contrastive learning); 2) to set up one network as the target network (also called teacher network) and another network as the online network (also called student network), which is the same architecture as the target network plus an additional feedforward layer (called the predictor); 3) to fix the weights of the target/teacher network and only update the online/student network; 4) to update the weights of the target/teacher network based on the weights of the online/student network.
The most important designs are 1) the online network needs to have the predictor (an additional layer); 2) only the weights of the online network can be updated; Otherwise the networks would collapse (i.e., outputting the same values regardless of the inputs).
In the image field, BYOL updated the weights of the target/teacher network by taking the exponential moving average (EMA) of the weights of the online/student network (ref); whereas SimSiam just simply copied the weights over (ref).
Data2vec from Meta is a unified framework for image, speech, and text fields (ref). It also takes EMA to update target/teacher network but it uses masking prediction task. It feeds the target/teacher network the original data and the online/student network the masked data. One important design is that its objective is to predict the averaged embedding of masked input regions/tokens of top few layers in the target/teacher network.