Table of Contents
Fetching ...

Contrastive Predictive Coding Done Right for Mutual Information Estimation

J. Jon Ryu, Pavan Yeddanapudi, Xiangxiang Xu, Gregory W. Wornell

TL;DR

This paper clarifies that the InfoNCE objective is not a direct mutual information estimator, but a variational bound on a generalization of Jensen–Shannon divergence. It introduces InfoNCE-anchor, an anchor-based modification that enables consistent density-ratio estimation and a plug-in MI estimator within a framework defined by proper scoring rules. The approach unifies several contrastive objectives (including DV, NWJ, and f-divergence variants) and shows that the log score yields the best MI estimates, though anchor–based improvements do not always translate to better downstream SSL performance; the benefits of contrastive learning arise from learning structured density ratios rather than exact MI values. Empirically, InfoNCE-anchor achieves state-of-the-art MI estimates across multiple domains and improves some downstream prediction tasks (e.g., protein interactions), while SSL results suggest PMI factorization and density-ratio structure are the key ingredients for learning useful representations. Overall, the work reframes contrastive learning from MI maximization to density-ratio factorization under a principled scoring-rule perspective.

Abstract

The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.

Contrastive Predictive Coding Done Right for Mutual Information Estimation

TL;DR

This paper clarifies that the InfoNCE objective is not a direct mutual information estimator, but a variational bound on a generalization of Jensen–Shannon divergence. It introduces InfoNCE-anchor, an anchor-based modification that enables consistent density-ratio estimation and a plug-in MI estimator within a framework defined by proper scoring rules. The approach unifies several contrastive objectives (including DV, NWJ, and f-divergence variants) and shows that the log score yields the best MI estimates, though anchor–based improvements do not always translate to better downstream SSL performance; the benefits of contrastive learning arise from learning structured density ratios rather than exact MI values. Empirically, InfoNCE-anchor achieves state-of-the-art MI estimates across multiple domains and improves some downstream prediction tasks (e.g., protein interactions), while SSL results suggest PMI factorization and density-ratio structure are the key ingredients for learning useful representations. Overall, the work reframes contrastive learning from MI maximization to density-ratio factorization under a principled scoring-rule perspective.

Abstract

The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as InfoNCE-anchor, for accurate MI estimation. Our modification introduces an auxiliary anchor class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and -divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.

Paper Structure

This paper contains 38 sections, 17 theorems, 86 equations, 5 figures, 7 tables.

Key Result

Proposition 1

$\mathcal{D}_{\text{InfoNCE}}(\theta) \le \min\{\log K, D({\textcolor{orange}{q_1}}~\|~{\textcolor{blue}{q_0}})\}$.

Figures (5)

  • Figure 1: Summary of MI estimation results on the standard benchmark. All experiments were done with batch size 64 and averaged over 20 random runs. Across all the cases, the proposed InfoNCE-anchor estimator (the rightmost column) consistently demonstrates low-bias, low-variance performance compared to the existing estimators. See Section \ref{['sec:exp_mi_estimation']} for the experiment setup.
  • Figure 2: Summary of the protein interaction prediction experiment.
  • Figure 3: Summary of MI estimation results on the standard benchmark on the Gaussian cubic data, with different batch sizes.
  • Figure 4: ROC curves from different estimators.
  • Figure 5: Histograms of pointwise MI ($\log\frac{p(x,y)}{p(x)p(y)}$) from different estimators.

Theorems & Definitions (32)

  • Proposition 1
  • Theorem 2
  • Theorem 3: Fisher consistency
  • Definition 4: Proper scoring rules
  • Proposition 5
  • Theorem 6
  • Theorem 7
  • Definition 8
  • Theorem 9
  • Lemma 10
  • ...and 22 more