Table of Contents
Fetching ...

Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

Achleshwar Luthra, Tianbao Yang, Tomer Galanti

TL;DR

Self-supervised contrastive learning (CL) often yields class-aware structure without labels. The authors prove a duality showing CL implicitly approximates a negatives-only supervised contrastive loss (NSCL), with a label-agnostic bound that tightens as the number of classes grows. They characterize NSCL minimizers as exhibiting augmentation collapse, within-class collapse, and a simplex ETF of class centers, implying strong few-shot linear-probe performance, and introduce a directional CDNV-based bound that links few-shot error to directional variance. Empirically, the CL-NSCL gap contracts with more classes, and a tight bound predicts downstream probing performance across datasets and architectures, providing both theoretical insight and practical guidance for SSL pre-training.

Abstract

Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability--within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion's effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of $\mathcal{O}(\frac{1}{\#\text{classes}})$; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice. The code and project page of the paper are available at [\href{https://github.com/DLFundamentals/understanding-ssl}{code}, \href{https://dlfundamentals.github.io/ssl-is-approximately-sl/}{project page}].

Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

TL;DR

Self-supervised contrastive learning (CL) often yields class-aware structure without labels. The authors prove a duality showing CL implicitly approximates a negatives-only supervised contrastive loss (NSCL), with a label-agnostic bound that tightens as the number of classes grows. They characterize NSCL minimizers as exhibiting augmentation collapse, within-class collapse, and a simplex ETF of class centers, implying strong few-shot linear-probe performance, and introduce a directional CDNV-based bound that links few-shot error to directional variance. Empirically, the CL-NSCL gap contracts with more classes, and a tight bound predicts downstream probing performance across datasets and architectures, providing both theoretical insight and practical guidance for SSL pre-training.

Abstract

Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability--within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion's effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of ; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice. The code and project page of the paper are available at [\href{https://github.com/DLFundamentals/understanding-ssl}{code}, \href{https://dlfundamentals.github.io/ssl-is-approximately-sl/}{project page}].

Paper Structure

This paper contains 17 sections, 8 theorems, 90 equations, 13 figures, 3 tables.

Key Result

Theorem 1

Let $S = \{(x_{i}, y_i)\}_{i=1}^{N} \subset \mathcal{X} \times [C]$ be a labeled dataset with $C$ classes, each containing at most $n_{\max}$ distinct samples. Let $f:\mathcal{X} \to \mathbb{R}^d$ be any function. Then, we have where $\mathrm{e}$ denotes Euler's constant. For a balanced classification problem, $\tfrac{n_{\max}}{N-n_{\max}}= \tfrac{1}{C-1}$.

Figures (13)

  • Figure 1: DCL forms semantic clusters without label supervision, while NSCL yields tighter, more separable clusters, despite not explicitly pulling same-class samples together. We plot UMAP visualizations for (top) decoupled contrastive learning (DCL) 10.1007/978-3-031-19809-0_38 and (bottom) negatives-only supervised contrastive learning (NSCL) training on mini-ImageNet. See Appendix \ref{['app:experiments']} for details.
  • Figure 2: Illustration of CDNV and directional CDNV. CDNV compares how tightly samples within a class cluster to how far apart class centers are; lower values indicate tighter clusters and larger gaps between classes. The latter measures variability only along the line connecting two class centers, highlighting the component most relevant for distinguishing those classes.
  • Figure 3: (Top) We train the model to minimize the DCL loss, tracking the DCL loss, the NSCL loss, and the bound NSCL+$\log(1+\tfrac{n_{\max}\mathrm{e}^2}{N-n_{\max}})$ on both the training and test sets throughout training. All three quantities are highly correlated.(Bottom) We compare the NSCL loss of two models: one trained with the DCL loss and the other with the NSCL loss. The resulting NSCL losses are comparable, regardless of the training objective. In both the top and bottom plots, correlations are computed between the DCL and NSCL losses on the train and test data.
  • Figure 4: The gap between the DCL and NSCL losses shrinks as the number of classes $C$ grows. Models were trained to minimize the DCL loss, and at several training epochs we plot the empirical difference $\mathcal{L}^{\mathrm{DCL}} - \mathcal{L}^{\mathrm{NSCL}}$ alongside the bound $\log(1 + \tfrac{\mathrm{e}^2}{C-1})$ as a function of $C$. We also report correlation between the loss gap at epoch $300$ and the bound.
  • Figure 5: The bound in Cor. \ref{['cor:error_bound']} is fairly tight for ImageNet pre-trained models. We reproduced Fig. \ref{['fig:few_shot']} with a ResNet-50 pretrained on IM-1K.
  • ...and 8 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Proposition 1
  • proof
  • Corollary 1
  • ...and 3 more