Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse
Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar
TL;DR
This work analyzes hard-negative sampling in contrastive learning under a general latent data model, proving that HSCL and HUCL losses cannot underperform SCL and UCL, respectively, via Harris inequality. It shows that the globally optimal geometry for SCL/HSCL corresponds to Neural-Collapse, where class means form a normalized Equiangular Tight Frame and within-class variance vanishes, without requiring class-conditional independence of augmented views; a parallel NC-optimality result is established for UCL when embedding dimension exceeds the latent-class count. The authors extend these results to empirical and batched losses, discuss practical achievability through Adam optimization with unit-ball/sphere normalization, and demonstrate that hard-negatives mitigate Dimensional-Collapse in real-data experiments (CIFAR-10/100, Tiny ImageNet) while enabling convergence to NC under suitable conditions. They also reveal that, without hard-negatives or normalization, representations tend toward DC, underscoring the essential role of both hard-negative sampling and normalization. The work closes with open questions about HUCL’s tight lower bounds, latent-class collisions, and the training dynamics underpinning NC attainment, and provides publicly available code for replication.
Abstract
For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main
