Table of Contents
Fetching ...

InfoNCE Induces Gaussian Distribution

Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa

TL;DR

This work shows that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training, and establishes this result in two complementary regimes to provide a principled explanation for commonly observed Gaussianity in contrastive representations.

Abstract

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

InfoNCE Induces Gaussian Distribution

TL;DR

This work shows that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training, and establishes this result in two complementary regimes to provide a principled explanation for commonly observed Gaussianity in contrastive representations.

Abstract

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
Paper Structure (43 sections, 10 theorems, 52 equations, 11 figures, 2 tables)

This paper contains 43 sections, 10 theorems, 52 equations, 11 figures, 2 tables.

Key Result

Proposition 1

Let $X, Y \sim \mathcal{A}(\cdot \mid X_0)$ be conditionally independent given the base sample $X_0$, and let $u = \hat{f}(X)$, $v = \hat{f}(Y)$ be normalized representations in $\mathbb{S}^{d-1}$, i.e., $\|u\| = \|v\| = 1$. Then where $\eta_2 = \rho_m^2(X, X_0)$ is the squared HGR maximal correlation between the view and the base, and $\mu$ is the marginal law of $u$.

Figures (11)

  • Figure 1: Illustration. Contrastive learning yields (approximately) Gaussian representations.
  • Figure 2: Uniformity vs. alignment across settings. A simple linear encoder trained on synthetic Laplace data exhibits (i) near-optimal alignment across all configurations and (ii) steadily improving uniformity as batch size or dimensionality grow.
  • Figure 3: Synthetic data experiments. Left: representation norm statistics vs. batch size (curves denote dimension), showing thin-shell concentration with increasing $d$ and $N$. Top middle/right: norm histograms illustrating radius tightening. Bottom: normality diagnostics (AD, DP), with averages in the Gaussian acceptance range.
  • Figure 4: CIFAR-10 training dynamics. A two-layer MLP trained with InfoNCE on CIFAR-10 exhibits increasing Gaussianity over training. Left: representation norms concentrate as indicated by declining CV (Eq. \ref{['eq:cv-radius']}). Middle: the AD statistic decreases from non-Gaussian levels into the normal range. Right: the fraction of coordinates passing the DP normality test rises steadily.
  • Figure 5: vMF exponential tilt distribution for different concentration scales kappa ($\kappa$).
  • ...and 6 more figures

Theorems & Definitions (16)

  • Proposition 1: Augmentation-controlled alignment bound
  • Corollary 1: Gaussian $k$-projections at the plateau
  • Proposition 2: Gaussian projections for unnormalized representations
  • Proposition 3
  • Definition 1
  • Lemma 1
  • Theorem 1
  • proof
  • Corollary 2
  • proof
  • ...and 6 more