Regularization

This is another approach only needs positive pairs without negative examples. Surprisingly, these methods can use the identical architectures for the two networks and they also don’t need the ‘stop gradient’ mechanism to only update one of the networks during training. By adding extra regularization terms, the model also does not collapse. The objective function terms include:


Invariance: the loss term keeps the two embeddings from the same positive pair as similar as possible. Barlow Twins’s and DeLoRes’s invariance terms seek to equate the diagonal elements of the cross-correlation matrix to 1 in the image field and the audio field respectively; In the image field, VICReg minimizes the mean-squared euclidean distance between two embeddings (ref).

Variance: the regularization term keeps the samples in the same batch different enough since they are not the same sample. Barlow Twins’s and DeLoRes’s redundancy reduction term tries to equate the off-diagonal elements of the cross-correlation matrix to 0 in the image field and audio field respectively. In the image field, VICReg’s variance term uses a hinge loss to keep the standard deviation of the embedding outputs across the samples in the same batch above a threshold (ref). VICReg’s covariance term minimizes the magnitude of the off-diagonal terms in the covariance matrix to decorrelate each pair of embeddings. This term can greatly boost the performance and maximizes the efficiency of using all the dimensions of embedding vectors. However, it is not required for preventing informational collapse (ref).

The VICReg paper has shown that VICReg is more robust to different network architecture comparing to other self-supervised frameworks (Barlow Twins and SimCLR). Therefore, it can enable the multi-modal application in the future.