Table of Contents
Fetching ...

Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse

Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar

TL;DR

This work analyzes hard-negative sampling in contrastive learning under a general latent data model, proving that HSCL and HUCL losses cannot underperform SCL and UCL, respectively, via Harris inequality. It shows that the globally optimal geometry for SCL/HSCL corresponds to Neural-Collapse, where class means form a normalized Equiangular Tight Frame and within-class variance vanishes, without requiring class-conditional independence of augmented views; a parallel NC-optimality result is established for UCL when embedding dimension exceeds the latent-class count. The authors extend these results to empirical and batched losses, discuss practical achievability through Adam optimization with unit-ball/sphere normalization, and demonstrate that hard-negatives mitigate Dimensional-Collapse in real-data experiments (CIFAR-10/100, Tiny ImageNet) while enabling convergence to NC under suitable conditions. They also reveal that, without hard-negatives or normalization, representations tend toward DC, underscoring the essential role of both hard-negative sampling and normalization. The work closes with open questions about HUCL’s tight lower bounds, latent-class collisions, and the training dynamics underpinning NC attainment, and provides publicly available code for replication.

Abstract

For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main

Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse

TL;DR

This work analyzes hard-negative sampling in contrastive learning under a general latent data model, proving that HSCL and HUCL losses cannot underperform SCL and UCL, respectively, via Harris inequality. It shows that the globally optimal geometry for SCL/HSCL corresponds to Neural-Collapse, where class means form a normalized Equiangular Tight Frame and within-class variance vanishes, without requiring class-conditional independence of augmented views; a parallel NC-optimality result is established for UCL when embedding dimension exceeds the latent-class count. The authors extend these results to empirical and batched losses, discuss practical achievability through Adam optimization with unit-ball/sphere normalization, and demonstrate that hard-negatives mitigate Dimensional-Collapse in real-data experiments (CIFAR-10/100, Tiny ImageNet) while enabling convergence to NC under suitable conditions. They also reveal that, without hard-negatives or normalization, representations tend toward DC, underscoring the essential role of both hard-negative sampling and normalization. The work closes with open questions about HUCL’s tight lower bounds, latent-class collisions, and the training dynamics underpinning NC attainment, and provides publicly available code for replication.

Abstract

For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main
Paper Structure (32 sections, 6 theorems, 32 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 6 theorems, 32 equations, 15 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\psi_k$ in (eq:genCLlossfun) be argument-wise non-decreasing over ${\mathbb R}^k$ and assume that all expectations associated with $L^{(k)}_{UCL}(f)$, $L^{(k)}_{HUCL}(f)$, $L^{(k)}_{SCL}(f)$, $L^{(k)}_{HSCL}(f)$ exist and are finite. Then, for all $f$ and all $k$, $L^{(k)}_{HUCL}(f) \geq L^{(k

Figures (15)

  • Figure 1: Graphical model for augmentation and negative sampling for SCL and UCL settings used in practical implementations such as Sim-CLR.
  • Figure 2: Synthetic dataset results using label information (top row figures) or additive Gaussian noise augmentation mechanism (bottom row figures) for generating anchor-positive pairs: Initial two-dimensional representations (left), post-training SCL and HSCL representations and losses at different hardness levels (middle), post-training UCL and HUCL representations and losses at different hardness levels (right).
  • Figure 3: Results for CIFAR100 under supervised settings (SCL, HSCL, left column) and unsupervised settings (UCL, HUCL, right column) with unit-ball normalization and random initialization. From top to bottom: Downstream Test Accuracy, Zero-sum metric, Unit-norm metric, and Equal inner-product metric, all plotted against the number of epochs.
  • Figure 4: Normalized singular values of the empirical covariance matrix of class means (in representation space) plotted in log-scale in decreasing order for CIFAR100 under supervised (left column) and unsupervised (right column) settings. The horizontal axis is the sorted index of the singular values. From top to bottom: Unit-ball normalization with random initialization, Unit-ball normalization with near-NC initialization, Unit-sphere normalization with random initialization, and un-normalized representation with random initialization.
  • Figure 5: The values of the UCL loss lower-bound in Theorem \ref{['thm:genUCLlosslb']} for the InfoNCE loss function, for $k=1,2,3,4,5$, negative samples and $C = 2, 3, \ldots, 20$, latent classes.
  • ...and 10 more figures

Theorems & Definitions (16)

  • Definition 1: Generalized Contrastive Loss
  • Definition 2: Hardening function
  • Definition 3: $\eta$-harder negatives for SCL
  • Definition 4: $\eta$-harder negatives for UCL
  • Theorem 1: Hard-negative CL versus CL losses
  • Lemma 1: Harris-inequality, Theorem 2.15 in BLMbook2013
  • Corollary 1
  • Theorem 2: Lower bound for SCL loss and conditions for equality with unit-ball representations and equiprobable classes
  • Remark 1
  • Remark 2
  • ...and 6 more