Whitening for Self-Supervised Representation Learning
Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe
TL;DR
This paper introduces a novel self-supervised learning loss called Whitening MSE (W-MSE) that eliminates the need for negative samples by whitening batch embeddings to enforce a spherical distribution and minimizing inter-positive distances. By applying a Cholesky-based whitening to the final representations and evaluating multiple positives per image, W-MSE avoids collapsed solutions without requiring momentum encoders or stop-gradient schemes. Empirical results across CIFAR, STL-10, Tiny ImageNet, and ImageNet-100 show W-MSE, especially with four positives, is competitive with or superior to state-of-the-art SSL methods like BYOL and SimSiam, while maintaining a simpler architecture. The findings highlight whitening as a viable mechanism to stabilize SSL training and reduce dependence on large negative batches, with potential for combination with asymmetric approaches in future work.
Abstract
Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives"). For the learning to be effective, many negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.
