On the connection between Noise-Contrastive Estimation and Contrastive Divergence

Amanda Olmin; Jakob Lindqvist; Lennart Svensson; Fredrik Lindsten

On the connection between Noise-Contrastive Estimation and Contrastive Divergence

Amanda Olmin, Jakob Lindqvist, Lennart Svensson, Fredrik Lindsten

TL;DR

The paper addresses training unnormalised probabilistic models by bridging Noise-Contrastive Estimation (NCE) with maximum likelihood via importance sampling (ML-IS) and with Contrastive Divergence (CD). It shows RNCE is ML estimation with Conditional Importance Sampling (CIS) and that RNCE and CNCE are special cases of CD, unifying NCE within ML/CD frameworks and enabling cross-pollination of techniques. A key practical insight is that the noise proposal should resemble the model distribution (q ≈ p_θ), and the authors propose adaptive proposals, persistent variants, and SMC-based extensions to RNCE/CNCE, supported by theoretical arguments. Empirical results on autoregressive EBMs demonstrate gains from RNCE, MH-CNCE, persistence, and SMC-RNCE across multiple datasets, highlighting the approach's robustness and scalability for unnormalised models.

Abstract

Noise-contrastive estimation (NCE) is a popular method for estimating unnormalised probabilistic models, such as energy-based models, which are effective for modelling complex data distributions. Unlike classical maximum likelihood (ML) estimation that relies on importance sampling (resulting in ML-IS) or MCMC (resulting in contrastive divergence, CD), NCE uses a proxy criterion to avoid the need for evaluating an often intractable normalisation constant. Despite apparent conceptual differences, we show that two NCE criteria, ranking NCE (RNCE) and conditional NCE (CNCE), can be viewed as ML estimation methods. Specifically, RNCE is equivalent to ML estimation combined with conditional importance sampling, and both RNCE and CNCE are special cases of CD. These findings bridge the gap between the two method classes and allow us to apply techniques from the ML-IS and CD literature to NCE, offering several advantageous extensions.

On the connection between Noise-Contrastive Estimation and Contrastive Divergence

TL;DR

Abstract

Paper Structure (38 sections, 8 theorems, 57 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 38 sections, 8 theorems, 57 equations, 4 figures, 3 tables, 3 algorithms.

Introduction
Background
Importance sampling
Contrastive divergence
Noise-contrastive estimation
Importance sampling and RNCE
Connecting NCE with CD
RNCE criterion
CNCE criterion
Insights from CD connection
Choice of proposal distribution $q$
Persistent NCE
MH variant of CNCE
Sequential Monte Carlo RNCE
Experiments
...and 23 more sections

Key Result

Proposition 3.1

RNCE is equivalent to ML estimation using CIS, conditioning on $\mathbf{x}_0 \sim p_d(\cdot)$, for estimating the normalisation constant in eq:unnorm_model:ll.

Figures (4)

Figure 1: Left: Convergence of $p_{\theta}$ for different choices of proposal distribution $q$. Here, $q_{\varphi}$ is initialised at $p_d$ and we show the median divergence $\mathrm{KL} \left[ p_d \| p_{\theta} \right]$. The error bars mark the 25th and 75th percentile respectively, estimated from 20 repetitions. Middle-Right: Results for ring model experiments reported over training iterations and as median (solid lines) and worst-case (dashed lines) estimated from 100 experiments. Middle: Squared parameter error of CNCE, CNCE with Metropolis--Hastings acceptance probability (MH-CNCE), persistent CNCE (P-CNCE) and persistent MH-CNCE (P-MH-CNCE). Right: Acceptance probability of (P-)CNCE and (P-)MH-CNCE when training with (P-)CNCE.
Figure 2: Results for ring model experiments with $N=200$ training data points. Results are reported over training iterations and as median (solid lines) and worst-case (dashed lines) estimated from 100 experiments. Left: Squared parameter error of standard CNCE, CNCE with Metropolis--Hastings acceptance probability (MH-CNCE), persistent CNCE (P-CNCE) and persistent MH-CNCE (P-MH-CNCE). Middle: Acceptance probability of (P-)CNCE and (P-)MH-CNCE when training with (P-)CNCE. Right: Acceptance probability of CNCE and MH-CNCE when training with (P-)MH-CNCE.
Figure 3: Results for ring model experiments with $N=1000$ training data points. Results are reported over training iterations and as median (solid lines) and worst-case (dashed lines) estimated from 100 experiments. Left: Squared parameter error of standard CNCE, CNCE with Metropolis--Hastings acceptance probability (MH-CNCE) and persistent CNCE (P-CNCE). Middle: Acceptance probability of CNCE and MH-CNCE when training with (P-)CNCE. Right: Acceptance probability of CNCE and MH-CNCE when training with (P-)MH-CNCE.
Figure : CIS kernel

Theorems & Definitions (9)

Proposition 3.1: RNCE is ML-CIS
Proposition 3.2: Unbiased CIS estimate of $\nabla_{\theta} \log Z_{\theta}$
Proposition 4.1: RNCE = CD-1
Proposition 4.2: CNCE = CD-1
Proposition 5.1: Gradient estimate for RNCE with $q = p_{\theta}$
Proposition 5.2: Gradient estimate for CNCE with $q = p_{\theta}$
Proposition 5.3: Unbiased CIS estimate of $\nabla_{\varphi} {} \mathcal{L}(\varphi)$
Lemma A.1: Unbiased general CIS estimate
proof

On the connection between Noise-Contrastive Estimation and Contrastive Divergence

TL;DR

Abstract

On the connection between Noise-Contrastive Estimation and Contrastive Divergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (9)