The Common Stability Mechanism behind most Self-Supervised Learning Approaches
Abhishek Jha, Matthew B. Blaschko, Yuki M. Asano, Tinne Tuytelaars
TL;DR
This work addresses embedding collapse in self-supervised learning by proposing a center-vector framework: a global center $s = \mathbb{E}_{x}[\mathbb{E}_{\omega}[g(\omega(x))]]$ that must be minimized in magnitude to maintain nontrivial representations while preserving a residual, sample-specific component. By reframing both contrastive and non-contrastive SSL methods as optimizing toward zero $|s|$ and maintaining residuals, the authors unify diverse architectures under a common stability principle. They derive gradient interpretations for approaches like Triplet/InfoNCE, SimSiam, BYOL, DINO, SwAV, and Barlow-Twins, and validate the framework with experiments on toy distributions and Imagenet100, including a simple penalized objective $L_{Simple}(f) = 0.5\left(L(f) - \lambda_L s\right)$ that can outperform standard baselines. Key findings show that center-vector magnitude tracks collapse risk and that mechanisms such as predictors, EMA targets, centering, and fixed prototypes modulate $s$ to sustain non-collapsed representations. The framework offers practical guidance for SSL design and hyperparameter selection, enabling robust learning even without negative samples or complex architectural changes.
Abstract
Last couple of years have witnessed a tremendous progress in self-supervised learning (SSL), the success of which can be attributed to the introduction of useful inductive biases in the learning process to learn meaningful visual representations while avoiding collapse. These inductive biases and constraints manifest themselves in the form of different optimization formulations in the SSL techniques, e.g. by utilizing negative examples in a contrastive formulation, or exponential moving average and predictor in BYOL and SimSiam. In this paper, we provide a framework to explain the stability mechanism of these different SSL techniques: i) we discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO; ii) we provide an argument that despite different formulations these methods implicitly optimize a similar objective function, i.e. minimizing the magnitude of the expected representation over all data samples, or the mean of the data distribution, while maximizing the magnitude of the expected representation of individual samples over different data augmentations; iii) we provide mathematical and empirical evidence to support our framework. We formulate different hypotheses and test them using the Imagenet100 dataset.
