Table of Contents
Fetching ...

Whitening for Self-Supervised Representation Learning

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

TL;DR

This paper introduces a novel self-supervised learning loss called Whitening MSE (W-MSE) that eliminates the need for negative samples by whitening batch embeddings to enforce a spherical distribution and minimizing inter-positive distances. By applying a Cholesky-based whitening to the final representations and evaluating multiple positives per image, W-MSE avoids collapsed solutions without requiring momentum encoders or stop-gradient schemes. Empirical results across CIFAR, STL-10, Tiny ImageNet, and ImageNet-100 show W-MSE, especially with four positives, is competitive with or superior to state-of-the-art SSL methods like BYOL and SimSiam, while maintaining a simpler architecture. The findings highlight whitening as a viable mechanism to stabilize SSL training and reduce dependence on large negative batches, with potential for combination with asymmetric approaches in future work.

Abstract

Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives"). For the learning to be effective, many negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.

Whitening for Self-Supervised Representation Learning

TL;DR

This paper introduces a novel self-supervised learning loss called Whitening MSE (W-MSE) that eliminates the need for negative samples by whitening batch embeddings to enforce a spherical distribution and minimizing inter-positive distances. By applying a Cholesky-based whitening to the final representations and evaluating multiple positives per image, W-MSE avoids collapsed solutions without requiring momentum encoders or stop-gradient schemes. Empirical results across CIFAR, STL-10, Tiny ImageNet, and ImageNet-100 show W-MSE, especially with four positives, is competitive with or superior to state-of-the-art SSL methods like BYOL and SimSiam, while maintaining a simpler architecture. The findings highlight whitening as a viable mechanism to stabilize SSL training and reduce dependence on large negative batches, with potential for combination with asymmetric approaches in future work.

Abstract

Most of the current self-supervised representation learning (SSL) methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives"). For the learning to be effective, many negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.

Paper Structure

This paper contains 13 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A schematic representation of the W-MSE based optimization process. Positive pairs are indicated with the same shapes and colors. (1) A representation of the batch features in $V$ when training starts. (2, 3) The distribution of the elements after whitening and the $L_2$ normalization. (4) The MSE computed over the normalized $\mathbf{z}$ features encourages the network to move the positive pair representations closer to each other. (5) The subsequent iterations move closer and closer the positive pairs, while the relative layout of the other samples is forced to lie in a spherical distribution.
  • Figure 2: A scheme of our training procedure. First, $d$ ($d =4$ in this case) positive samples are generated using augmentations. These images are transformed into vectors with the encoder $E(\cdot)$. Next, they are projected onto a lower dimensional space with a projection head $g(\cdot)$. Then, Whitening projects these vectors onto a spherical distribution, optionally followed by an $L_2$ normalization. Finally, the dashed curves show all the $d (d - 1) / 2$ comparisons used in our W-MSE loss.
  • Figure 3: Batch slicing. $V$ is first partitioned in $d$ parts ($d=2$ in this example). We randomly permute the first part and we apply the same permutation to the other $d-1$ parts. Then, we further split all the partitions and we create sub-batches ($V_i$). Each $V_i$ is independently used to compute the sub-batch specific whitening matrix $W_V^i$ and centroid $\boldsymbol{\mu}_V^i$.
  • Figure 4: Training dynamics on the STL-10 dataset (linear-classifier based evaluation).
  • Figure 5: Training dynamics on the STL-10 dataset (5-nn classifier based evaluation).