Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning
Achleshwar Luthra, Tianbao Yang, Tomer Galanti
TL;DR
Self-supervised contrastive learning (CL) often yields class-aware structure without labels. The authors prove a duality showing CL implicitly approximates a negatives-only supervised contrastive loss (NSCL), with a label-agnostic bound that tightens as the number of classes grows. They characterize NSCL minimizers as exhibiting augmentation collapse, within-class collapse, and a simplex ETF of class centers, implying strong few-shot linear-probe performance, and introduce a directional CDNV-based bound that links few-shot error to directional variance. Empirically, the CL-NSCL gap contracts with more classes, and a tight bound predicts downstream probing performance across datasets and architectures, providing both theoretical insight and practical guidance for SSL pre-training.
Abstract
Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability--within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion's effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of $\mathcal{O}(\frac{1}{\#\text{classes}})$; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice. The code and project page of the paper are available at [\href{https://github.com/DLFundamentals/understanding-ssl}{code}, \href{https://dlfundamentals.github.io/ssl-is-approximately-sl/}{project page}].
