Table of Contents
Fetching ...

Bridging Mini-Batch and Asymptotic Analysis in Contrastive Learning: From InfoNCE to Kernel-Based Losses

Panagiotis Koromilas, Giorgos Bouritsas, Theodoros Giannakopoulos, Mihalis Nicolaou, Yannis Panagakis

TL;DR

The paper addresses why diverse contrastive learning losses sometimes diverge in practice and how they relate to hyperspherical energy minimisation (HEM). It proves that, under both minibatch and asymptotic regimes, InfoNCE-family losses share minimisers and are tied to energy minimisation on the sphere, then introduces Decoupled Hyperspherical Energy Loss (DHEL) to separate alignment from uniformity. It further extends the analysis to Kernel Contrastive Learning (KCL), showing batch-size independence of the expected loss and identifying non-asymptotic minimisers (e.g., regular simplices, cross-polytopes) under kernel assumptions. Empirical results across CIFAR-10/100, STL-10, and ImageNet-100 demonstrate improved downstream performance, robustness to batch size and hyperparameters, and reduced dimensionality collapse, underscoring the practical value of the proposed losses and the energy-centric perspective.

Abstract

What do different contrastive learning (CL) losses actually optimize for? Although multiple CL methods have demonstrated remarkable representation learning capabilities, the differences in their inner workings remain largely opaque. In this work, we analyse several CL families and prove that, under certain conditions, they admit the same minimisers when optimizing either their batch-level objectives or their expectations asymptotically. In both cases, an intimate connection with the hyperspherical energy minimisation (HEM) problem resurfaces. Drawing inspiration from this, we introduce a novel CL objective, coined Decoupled Hyperspherical Energy Loss (DHEL). DHEL simplifies the problem by decoupling the target hyperspherical energy from the alignment of positive examples while preserving the same theoretical guarantees. Going one step further, we show the same results hold for another relevant CL family, namely kernel contrastive learning (KCL), with the additional advantage of the expected loss being independent of batch size, thus identifying the minimisers in the non-asymptotic regime. Empirical results demonstrate improved downstream performance and robustness across combinations of different batch sizes and hyperparameters and reduced dimensionality collapse, on several computer vision datasets.

Bridging Mini-Batch and Asymptotic Analysis in Contrastive Learning: From InfoNCE to Kernel-Based Losses

TL;DR

The paper addresses why diverse contrastive learning losses sometimes diverge in practice and how they relate to hyperspherical energy minimisation (HEM). It proves that, under both minibatch and asymptotic regimes, InfoNCE-family losses share minimisers and are tied to energy minimisation on the sphere, then introduces Decoupled Hyperspherical Energy Loss (DHEL) to separate alignment from uniformity. It further extends the analysis to Kernel Contrastive Learning (KCL), showing batch-size independence of the expected loss and identifying non-asymptotic minimisers (e.g., regular simplices, cross-polytopes) under kernel assumptions. Empirical results across CIFAR-10/100, STL-10, and ImageNet-100 demonstrate improved downstream performance, robustness to batch size and hyperparameters, and reduced dimensionality collapse, underscoring the practical value of the proposed losses and the energy-centric perspective.

Abstract

What do different contrastive learning (CL) losses actually optimize for? Although multiple CL methods have demonstrated remarkable representation learning capabilities, the differences in their inner workings remain largely opaque. In this work, we analyse several CL families and prove that, under certain conditions, they admit the same minimisers when optimizing either their batch-level objectives or their expectations asymptotically. In both cases, an intimate connection with the hyperspherical energy minimisation (HEM) problem resurfaces. Drawing inspiration from this, we introduce a novel CL objective, coined Decoupled Hyperspherical Energy Loss (DHEL). DHEL simplifies the problem by decoupling the target hyperspherical energy from the alignment of positive examples while preserving the same theoretical guarantees. Going one step further, we show the same results hold for another relevant CL family, namely kernel contrastive learning (KCL), with the additional advantage of the expected loss being independent of batch size, thus identifying the minimisers in the non-asymptotic regime. Empirical results demonstrate improved downstream performance and robustness across combinations of different batch sizes and hyperparameters and reduced dimensionality collapse, on several computer vision datasets.
Paper Structure (36 sections, 13 theorems, 46 equations, 19 figures, 2 tables)

This paper contains 36 sections, 13 theorems, 46 equations, 19 figures, 2 tables.

Key Result

Theorem 4.1

Consider the following optimisation problem: where $\mathbf{U}, \mathbf{V}$ are tuples of $M$ vectors on the unit $d-1$-sphere and $L_{\textnormal{CL-sym}}$ is the symmetric version of any of the loss functions $L_{\textnormal{a}}(\cdot, \cdot; \phi, \psi), L_{\textnormal{b}}(\cdot, \cdot; \phi, \psi)$ as defined in Eq. eq:general_losses. Furt Additionally, (4) if $\psi, \phi$ are strictly increa

Figures (19)

  • Figure : (a) CIFAR10
  • Figure : (a) Rank
  • Figure : (a) Alignment
  • Figure : (a) CIFAR10
  • Figure : (b) CIFAR100
  • ...and 14 more figures

Theorems & Definitions (19)

  • Theorem 4.1
  • Corollary 4.2
  • Proposition 4.3
  • Theorem 5.1
  • Theorem 6.1
  • Proposition 6.2
  • Theorem 2.1
  • proof
  • Corollary 2.2
  • Corollary 2.3
  • ...and 9 more