One fundamental difference of VICReg compared to Barlow Twins is the way the branches are regularized. In VICReg, both branches are regularized independently, as the covariance term is applied on each branch separately, which works better in the scenarios where the branches are completely different, have different types of architecture and process different types of data. Indeed, the statistics of the output of the two branches can be very different, and the amount of regularization required for each may vary a lot. In Barlow Twins, the regularization is applied on the cross-correlation matrix, which favors the scenarios where the branches produce outputs with similar statistics. We demonstrate the capabilities of VICReg in a multi-modal experiment where we pretrain on pairs of images and corresponding captions on the MS-COCO dataset. We regularize each branch with a different coefficient, which is not possible with Barlow Twins, and we show that VICReg outperforms Barlow Twins on image and text retrieval downstream tasks. Table 3 reports the performance of VICReg against the contrastive loss proposed by VSE++ Faghri et al. (2018), and against Barlow Twins, in the identical setting proposed in Faghri et al. (2018). VICReg outperforms the two by a significant margin.