Table of Contents
Fetching ...

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti

TL;DR

The paper addresses why contrastive learning (CL) yields representations aligned with semantic classes by establishing representation-space coupling between CL and Negatives-Only Supervised Contrastive Learning (NSCL) under shared randomness. It develops a similarity-space analysis showing that CL and NSCL embeddings remain structurally aligned throughout training, deriving high-probability lower bounds on CKA and RSA that improve with more classes and higher temperature τ. Despite potentially divergent weight trajectories, the representation geometry remains tightly coupled, positioning NSCL as a principled bridge between self-supervised and supervised learning. Empirical validation across multiple datasets confirms the predicted trends and highlights stronger CL–NSCL alignment compared to other supervised objectives, suggesting practical utility in transferring insights between SSL and supervised paradigms.

Abstract

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

TL;DR

The paper addresses why contrastive learning (CL) yields representations aligned with semantic classes by establishing representation-space coupling between CL and Negatives-Only Supervised Contrastive Learning (NSCL) under shared randomness. It develops a similarity-space analysis showing that CL and NSCL embeddings remain structurally aligned throughout training, deriving high-probability lower bounds on CKA and RSA that improve with more classes and higher temperature τ. Despite potentially divergent weight trajectories, the representation geometry remains tightly coupled, positioning NSCL as a principled bridge between self-supervised and supervised learning. Empirical validation across multiple datasets confirms the predicted trends and highlights stronger CL–NSCL alignment compared to other supervised objectives, suggesting practical utility in transferring insights between SSL and supervised paradigms.

Abstract

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

Paper Structure

This paper contains 18 sections, 17 theorems, 114 equations, 7 figures, 1 table.

Key Result

Theorem 1

Fix $B,T\in\mathbb N$, $\delta\in(0,1)$, and temperature $\tau>0$. Consider the coupled similarity-descent recursions equation eq:Sigma-descent for CL and NSCL with shared initialization and shared mini-batches/augmentations. Then, with probability at least $1-\delta$ over the draws of the mini-batc

Figures (7)

  • Figure 1: Comparison of learning dynamics for CL and NSCL models. (a) Weight space vectors show divergent paths ($85.7^\circ$ apart). (b) In contrast, representation space vectors for a target class show high alignment ($27.8^\circ$ apart). (c) This is confirmed over training epochs, where representational similarity (CKA, RSA) remains high while the weight gap increases (see figure details in App. \ref{['app:experiments']}).
  • Figure 2: Alignment during training. We train ResNet-50 models with decoupled CL, SCL, NSCL, and CE. For the first 1,000 epochs, the CL-trained model is substantially more aligned with the NSCL-trained model than with the others. However, alignment declines when training continues much longer.
  • Figure 3: CL–NSCL alignment (linear CKA) increases with the number of training classes. The heatmaps show the linear CKA between CL and NSCL models. For each dataset, we visualize alignment on the training (top row, green) and test (bottom row, purple) sets. The y-axis indicates the number of classes ($N$) used for training, and the x-axis represents the training epoch. While alignment is consistently higher for larger $N$, it also tends to decrease as training progresses for any fixed $N$.
  • Figure 4: Higher $\tau$ increases the CL-NSCL alignment. The plots show RSA (top row) and CKA (bottom row) over 300 epochs. We trained CL and NSCL models with varying temperatures ($\tau \in \{ 0.1, 0.5, 1.0\}$) on four datasets. Across all datasets, a higher temperature $\tau = 1.0$ (shown in purple) evidently results in the highest alignment.
  • Figure 5: Effect of batch size with scaled learning rates. We trained CL, and NSCL models for 300 epochs on Mini-ImageNet, with varying batch-sizes ($B \in \{256,\, 512,\, 1024\}$). For each experiment, the learning rate $\eta$ is scaled as a function of batch-size, as mentioned under each panel. For instance, the results shown in panel (b) use a learning rate of $\eta = \frac{0.3\,\sqrt{B}}{256}$.
  • ...and 2 more figures

Theorems & Definitions (30)

  • Theorem 1: Similarity-space coupling
  • Corollary 1: CKA lower bound
  • Corollary 2: RSA lower bound
  • Theorem 2
  • Lemma 1: Anchor-block orthogonality
  • proof
  • Lemma 2: Softmax Hessian and gradient Lipschitzness
  • proof
  • Lemma 3: Per-anchor gradient norm and batch average
  • proof
  • ...and 20 more