Table of Contents
Fetching ...

The Common Stability Mechanism behind most Self-Supervised Learning Approaches

Abhishek Jha, Matthew B. Blaschko, Yuki M. Asano, Tinne Tuytelaars

TL;DR

This work addresses embedding collapse in self-supervised learning by proposing a center-vector framework: a global center $s = \mathbb{E}_{x}[\mathbb{E}_{\omega}[g(\omega(x))]]$ that must be minimized in magnitude to maintain nontrivial representations while preserving a residual, sample-specific component. By reframing both contrastive and non-contrastive SSL methods as optimizing toward zero $|s|$ and maintaining residuals, the authors unify diverse architectures under a common stability principle. They derive gradient interpretations for approaches like Triplet/InfoNCE, SimSiam, BYOL, DINO, SwAV, and Barlow-Twins, and validate the framework with experiments on toy distributions and Imagenet100, including a simple penalized objective $L_{Simple}(f) = 0.5\left(L(f) - \lambda_L s\right)$ that can outperform standard baselines. Key findings show that center-vector magnitude tracks collapse risk and that mechanisms such as predictors, EMA targets, centering, and fixed prototypes modulate $s$ to sustain non-collapsed representations. The framework offers practical guidance for SSL design and hyperparameter selection, enabling robust learning even without negative samples or complex architectural changes.

Abstract

Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset.

The Common Stability Mechanism behind most Self-Supervised Learning Approaches

TL;DR

This work addresses embedding collapse in self-supervised learning by proposing a center-vector framework: a global center that must be minimized in magnitude to maintain nontrivial representations while preserving a residual, sample-specific component. By reframing both contrastive and non-contrastive SSL methods as optimizing toward zero and maintaining residuals, the authors unify diverse architectures under a common stability principle. They derive gradient interpretations for approaches like Triplet/InfoNCE, SimSiam, BYOL, DINO, SwAV, and Barlow-Twins, and validate the framework with experiments on toy distributions and Imagenet100, including a simple penalized objective that can outperform standard baselines. Key findings show that center-vector magnitude tracks collapse risk and that mechanisms such as predictors, EMA targets, centering, and fixed prototypes modulate to sustain non-collapsed representations. The framework offers practical guidance for SSL design and hyperparameter selection, enabling robust learning even without negative samples or complex architectural changes.

Abstract

Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset.
Paper Structure (28 sections, 25 equations, 13 figures)

This paper contains 28 sections, 25 equations, 13 figures.

Figures (13)

  • Figure 1: Overview of our proposed learning hypothesis. Red and blue points represent different views of two images in feature space. (a) By applying distance minimization loss between two views of the same image, the magnitude of the expected representation over the data ($\mathbb{E}_x[z]$) increases, reducing the variance of the data distribution ($\sigma^2$) in the feature space and thereby reducing their separability. (b) In order to learn a discriminative feature representation, a negative force ($-s$) equal to the expected representation over the data distribution is required. We hypothesize that this negative term is the collapse avoidance mechanism underlying different SSL methods.
  • Figure 2: All methods covered by our proposed framework. For details please zoom-in.
  • Figure 3: Simplified SSL objective: We show that a simplified objective that minimizes the invariance loss with a center vector penalty (green), can outperform SimSiam. We plot the toy dataset distribution on left and performance curves on right for Blob and Moons dataset. Plots are averages of five runs with varying seeds, and variance is shown by shaded regions.
  • Figure 4: Evaluation on toy datasets: Standard SimSiam, SimSiam without predictor and SimSiam without stop-gradient have been shown in blue, red and pink respectively. Plots are averages of five runs with varying seeds, and variance is shown by shaded regions. Center vector is high for both the cases of collapse, i.e. SimSiam without predictor, and SimSiam without stop-gradient. This empirically verifies, the role of predictor and stop-gradient for collapse avoidance in SimSiam, based on our formulation. Input dataset distribution can be viewed in Figure \ref{['fig:cv_penalty_Vs_simsiam']}
  • Figure 5: Barlow-twins can learn non-collapse features without decorrelation term in the loss formulation: (a) shows the knn accuracy of Barlow-twins with and without the decorrelation terms in the loss on Imagenet100, (b) and (c) show the norm of the center vector of $z$ before and after BN, while training Barlow-twins in the two aforementioned settings, respectively. We can see that BN helps in removing the center vector component from $z$.
  • ...and 8 more figures