Table of Contents
Fetching ...

InfoNCE: Identifying the Gap Between Theory and Practice

Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, Wieland Brendel

TL;DR

This paper exposes a gap between the isotropic identifiability guarantees of InfoNCE and the anisotropic latent changes induced by practical data augmentations. It introduces AnInfoNCE, a generalized loss that models anisotropy with a diagonal scaling, and proves identifiability of latent factors up to a block-orthogonal transformation. Through synthetic, MNIST-VAE, and real-image experiments, it shows that AnInfoNCE improves latent-factor recovery but can trade off downstream classification accuracy, and it extends the theory to hard negative mining and ensemble losses. The work highlights remaining mismatches between theory and practice and points to future directions for closing the gap between identifiable representations and utility across tasks.

Abstract

Prior theory work on Contrastive Learning via the InfoNCE loss showed that, under certain assumptions, the learned representations recover the ground-truth latent factors. We argue that these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they either assume equal variance across all latents or that certain latents are kept invariant. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change with a continuum of variability across all factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Finally, we discuss the remaining mismatches between theoretical assumptions and practical implementations.

InfoNCE: Identifying the Gap Between Theory and Practice

TL;DR

This paper exposes a gap between the isotropic identifiability guarantees of InfoNCE and the anisotropic latent changes induced by practical data augmentations. It introduces AnInfoNCE, a generalized loss that models anisotropy with a diagonal scaling, and proves identifiability of latent factors up to a block-orthogonal transformation. Through synthetic, MNIST-VAE, and real-image experiments, it shows that AnInfoNCE improves latent-factor recovery but can trade off downstream classification accuracy, and it extends the theory to hard negative mining and ensemble losses. The work highlights remaining mismatches between theory and practice and points to future directions for closing the gap between identifiable representations and utility across tasks.

Abstract

Prior theory work on Contrastive Learning via the InfoNCE loss showed that, under certain assumptions, the learned representations recover the ground-truth latent factors. We argue that these theories overlook crucial aspects of how CL is deployed in practice. Specifically, they either assume equal variance across all latents or that certain latents are kept invariant. However, in practice, positive pairs are often generated using augmentations such as strong cropping to just a few pixels. Hence, a more realistic assumption is that all latent factors change with a continuum of variability across all factors. We introduce AnInfoNCE, a generalization of InfoNCE that can provably uncover the latent factors in this anisotropic setting, broadly generalizing previous identifiability results in CL. We validate our identifiability results in controlled experiments and show that AnInfoNCE increases the recovery of previously collapsed information in CIFAR10 and ImageNet, albeit at the cost of downstream accuracy. Finally, we discuss the remaining mismatches between theoretical assumptions and practical implementations.
Paper Structure (63 sections, 8 theorems, 48 equations, 13 figures, 3 tables)

This paper contains 63 sections, 8 theorems, 48 equations, 13 figures, 3 tables.

Key Result

Theorem 3.1

Under assum:non_isotropic_cl, if a pair $(\f,)$ minimizes eq:weightedcl, then $\f\circ\g$ is a block-orthogonal transformation, where each block acts on latents with equal weight $_{ii}$. In other words, is identified up to a block-orthogonal transformation.

Figures (13)

  • Figure 1: Illustration of the Mismatch Between the Standard Model and Practice.A: with the commonly used InfoNCE objective is identifiable when all latents change to the same extent across the positive pair zimmermann_contrastive_2021, which is unlikely to happen in practice. B: The more likely scenario, when augmentations affect different latents to a different extent, leads to dimensional collapse and information loss. C: Our proposed objective, , models features that can vary to a different degree in the positive pair, avoiding collapse.
  • Figure 2: Behavior of on Synthetic Data. ($\Lambda_1=5$) A: We maintain high -scores with for both content and style dimensions, while the style dimensions are lost when training with the regular InfoNCE loss. For $\Lambda_2=25$ (dotted black vertical line), we show: B: The evolution of the linear scores during training; C: reaches the global minimum, computed based on ground-truth latents; D: The evolution of the learned $\hat{}$ values.
  • Figure 3: Behavior of on MNIST.A: Linear identifiability () scores for an encoder trained on vae-generated MNIST samples, when varying $\Lambda_2$ and keeping $\Lambda_1 = 5$ for the positive conditional. Style dimensions collapse ($=0$) for regular InfoNCE. B: KNN-accuracy evaluated on the regular MNIST dataset using the encoders trained as described in A. KNN accuracy degrades when style dimensions are lost. C & D: The evolution of learned $\hat{}$ during training. We set the diagonal entries of $$ to ten different values by linearly interpolating between $5$ and $50$ (C) or $5$ and $200$ (D). The ground-truth $$ values are indicated by dashed lines. The learned $\hat{}$ are shown in the corresponding colors as solid lines.
  • Figure 4: Latent Dimensionality, Concentration, and Batch Size Influence Identifiability. Linear identifiability, quantified by the score between reconstructed and ground-truth latents, degrades with A: higher latent dimensionality $d$; and B: larger concentration parameter in the ground-truth positive conditional. Both detrimental effects can be countered by increasing the batch size.
  • Figure 5: Concentration of Positive Pairs Influences Augmentation Overlap. We figuratively visualize conditional distributions with large (A) and small (B) concentration parameters . For large concentration values, samples from the conditional distribution of two anchor points do not overlap, signifying missing augmentation overlap. This is not the case for small values.
  • ...and 8 more figures

Theorems & Definitions (30)

  • Theorem 3.1: Identifiability of Anisotropic
  • proof : Proof Sketch (full proof in \ref{['subsec:ident_base']})
  • Corollary 3.1: Identifiability with HN mining. Proof in \ref{['subsec:theory_hard_negatives']}
  • Corollary 3.2: Identifia. of ensemble
  • Definition A.1: Partitions of
  • Definition A.2: Content--style partitioning of
  • Remark B.1
  • Definition B.1: General Contrastive Learning (CL) Problem
  • Remark B.2
  • Theorem B.1: Bayes-Optima of CL
  • ...and 20 more