Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

Peter Ochieng

TL;DR

The paper develops a non-asymptotic spectral framework to bound the squared InfoNCE gradient norm in contrastive learning, tying gradient magnitudes to alignment, temperature, and batch covariance via a spectral band. It introduces spectrum-aware batch selection and a Greedy element–wise spectral builder to maintain training within a moderate-diversity window, improving convergence speed while preserving accuracy. Empirical validation across synthetic data and large-scale benchmarks (ImageNet variants) shows theoretical bounds track observed gradients, whitening reduces gradient variance, and Greedy-based batching yields notable speedups with modest overhead. The work bridges theory and practice by providing online diagnostics and practical batch-selection strategies that enhance stability and efficiency in contrastive learning.

Abstract

We derive non-asymptotic spectral bands that bound the squared InfoNCE gradient norm via alignment, temperature, and batch spectrum, recovering the \(1/τ^{2}\) law and closely tracking batch-mean gradients on synthetic data and ImageNet. Using effective rank \(R_{\mathrm{eff}}\) as an anisotropy proxy, we design spectrum-aware batch selection, including a fast greedy builder. On ImageNet-100, Greedy-64 cuts time-to-67.5\% top-1 by 15\% vs.\ random (24\% vs.\ Pool--P3) at equal accuracy; CIFAR-10 shows similar gains. In-batch whitening promotes isotropy and reduces 50-step gradient variance by \(1.37\times\), matching our theoretical upper bound.

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

TL;DR

Abstract

We derive non-asymptotic spectral bands that bound the squared InfoNCE gradient norm via alignment, temperature, and batch spectrum, recovering the

law and closely tracking batch-mean gradients on synthetic data and ImageNet. Using effective rank

as an anisotropy proxy, we design spectrum-aware batch selection, including a fast greedy builder. On ImageNet-100, Greedy-64 cuts time-to-67.5\% top-1 by 15\% vs.\ random (24\% vs.\ Pool--P3) at equal accuracy; CIFAR-10 shows similar gains. In-batch whitening promotes isotropy and reduces 50-step gradient variance by

, matching our theoretical upper bound.

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

TL;DR

Abstract

Diversity Is All You Need for Contrastive Learning: Spectral Bounds on Gradient Magnitudes

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (11)