Contrastive Neural Ratio Estimation for Simulation-based Inference

Benjamin Kurt Miller; Christoph Weniger; Patrick Forré

Contrastive Neural Ratio Estimation for Simulation-based Inference

Benjamin Kurt Miller, Christoph Weniger, Patrick Forré

TL;DR

This work introduces nre-c, a generalization of likelihood-to-evidence ratio estimation for simulation-based inference that avoids the bias inherent to the multiclass NRE-B setup. By adding an independent class and carefully designing the loss, nre-c eliminates the $c_{w}(x)$ bias at optimum and naturally recovers NRE-A and NRE-B in corner cases, while enabling informative diagnostics. The authors propose mutual-information bounds and an importance-sampling diagnostic to assess ratio quality, and validate performance across unlimited data, fast-prior drawing, and the sbibm benchmark, demonstrating superior posterior accuracy and robust diagnostics with $K>1$ and $\gamma\approx1$. They also show that normalizing the posterior is feasible within finite data regimes, and that mutual-information-based metrics can guide model selection without ground-truth posteriors. Overall, nre-c offers a principled, diagnostically verifiable, and scalable approach for amortized SBI across diverse data regimes. $r(\boldsymbol{x}|\boldsymbol{\theta})$ is estimated via a multiclass classifier, with a bias-free optimum simplifying normalization and enabling reliable diagnostics in practical SBI applications.

Abstract

Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest two bounds on the mutual information as performance metrics for simulation-based inference methods, without the need for posterior samples, and provide experimental results. This version corrects a minor implementation error in $γ$, improving results.

Contrastive Neural Ratio Estimation for Simulation-based Inference

TL;DR

bias at optimum and naturally recovers NRE-A and NRE-B in corner cases, while enabling informative diagnostics. The authors propose mutual-information bounds and an importance-sampling diagnostic to assess ratio quality, and validate performance across unlimited data, fast-prior drawing, and the sbibm benchmark, demonstrating superior posterior accuracy and robust diagnostics with

and

. They also show that normalizing the posterior is feasible within finite data regimes, and that mutual-information-based metrics can guide model selection without ground-truth posteriors. Overall, nre-c offers a principled, diagnostically verifiable, and scalable approach for amortized SBI across diverse data regimes.

is estimated via a multiclass classifier, with a bias-free optimum simplifying normalization and enabling reliable diagnostics in practical SBI applications.

Abstract

, improving results.

Paper Structure (44 sections, 1 theorem, 48 equations, 14 figures, 4 tables)

This paper contains 44 sections, 1 theorem, 48 equations, 14 figures, 4 tables.

Introduction
Contribution
Methods
nre-a
nre-b
Contrastive Neural Ratio Estimation
Optimization
Recovering nre-a and nre-b
Estimating a normalized posterior
Measuring performance & ratio estimator diagnostics
Comparing to a tractable posterior with estimates of exactness
Importance sampling diagnostic
Mutual information bound
Empirical expected coverage probability
Experiments
...and 29 more sections

Key Result

Lemma 1

Consider for $k=0,\dots,K$ the following probability distributions for $\boldsymbol{z}$: and $p(y) >0$ a probability distribution for $y$. Put $p_k:=p(y=k)$ for $k=1,\dots,K$. For functions $f_k:\, \mathcal{Z} \to \mathbb{R}$, $k=1,\dots,K$, let: Note that $q(y = k \, | \, f, \boldsymbol{z}) > 0$ for all $k=0,\dots,K$ and that $\sum_{k=0}^K q(y = k \, | \, f, \boldsymbol{z}) =1$ for every $K$-tu

Figures (14)

Figure 1: Conceptual, interpolated map from investigated hyperparameters of proposed algorithm nre-c to a measurement of posterior exactness using the Classifier Two-Sample Test. Best 0.5, worst 1.0. Red dot indicates nre-a's hyperparameters, $\gamma = 1$ and $K = 1$Hermans2019. Purple line implies nre-bDurkan2020 with $\gamma = \infty$ and $K \geq 1$. nre-c covers the entire plane, generalizing other methods. Best performance occurs with $K > 1$ and $\gamma \approx 1$, in contrast with the settings of existing algorithms.
Figure 2: Schematic depicts how the loss is computed in nre algorithms. ($\boldsymbol{\theta}$, $\boldsymbol{x}$) pairs are sampled from distributions at the top of the figure, entering the loss functions as depicted. nre-c controls the number of contrastive classes with $K$ and the weight of independent and dependent terms with $p_0$ and $p_K$. nre-c generalizes other algorithms. Hyperparameters recovering nre-a and nre-b are listed next to the name within the dashed areas. Notation details are defined in Section \ref{['sec:nrec']}.
Figure 3: The figures visualize the importance sampling diagnostic on ratio estimators trained using nre-b and nre-c. (a) Both methods produce satisfactory posterior estimates that agree with $p(\boldsymbol{\theta} \, | \, \boldsymbol{x})$. (b) $p(\boldsymbol{x} \, | \, \boldsymbol{\theta})$ is shown along with $p(\boldsymbol{x})$ samples weighted by nre-a$\exp \circ f_{\boldsymbol{w}}(\boldsymbol{\theta}, \boldsymbol{x})$ and nre-b$\exp \circ g_{\boldsymbol{w}}(\boldsymbol{\theta}, \boldsymbol{x})$. Each plot corresponds to a different $\boldsymbol{\theta}$. Despite high posterior accuracy, the nre-b estimates are distinct from $p(\boldsymbol{x} \, | \, \boldsymbol{\theta})$. (c) Two classifier's roc curves, each trained to distinguish $p(\boldsymbol{x} \, | \, \boldsymbol{\theta})$ samples from $p(\boldsymbol{x})$ samples weighted by the corresponding nre's $\hat{r}$ estimate. The classifier failed to distinguish likelihood samples from the nre-c weighted data samples, but successfully identified nre-b weighted samples. nre-b accurately approximates the posterior, but fails the diagnostic. nre-c produces an accurate posterior surrogate and passes the diagnostic.
Figure 4: Exactness of nre posterior surrogates is computed for various contrastive parameter counts, $\gamma$ values, and architectures on an average of three tasks from the sbi benchmark sbibm. C2ST assigns 1.0 to inaccurate and 0.5 to accurate posterior approximation. (top) $p(\boldsymbol{\theta}, \boldsymbol{x})$ was sampled at every mini-batch during training. The accuracy strongly depends on $K$. (mid) A fixed number of dependent $(\boldsymbol{\theta}, \boldsymbol{x})$ pairs were drawn, but $p(\boldsymbol{\theta})$ was sampled at every mini-batch during training. In this regime, $K$ has a smaller effect. (bot) The training data is completely fixed. Contrastive parameters are drawn in a bootstrap from the mini-batch. On the problems with fixed simulation data $\boldsymbol{x}$, higher $K$ improves accuracy and small $\gamma$ with larger architectures slightly improves performance. The effects of the architecture are more clearly seen on difficult problems like SLCP in Appendix \ref{['apndx:experimental-details']}.
Figure 5: Our proposed metrics, a pair of bounds on the mutual information $-\hat{I}_{\boldsymbol{w}}^{(0)}(\boldsymbol{\theta}; \boldsymbol{x})$ (top) and $-\hat{I}_{\boldsymbol{w}}^{(1)}(\boldsymbol{\theta}; \boldsymbol{x})$ (bottom), for the SLCP task estimated over the validation set versus training epochs using (a) nre-b and (b) nre-c with various values of $\gamma$ and $K$, a Large NN architecture, and fixed training data. Recall, the i.i.d. estimates of $-\hat{I}_{\boldsymbol{w}}^{(0)}(\boldsymbol{\theta}; \boldsymbol{x})$ are biased and $-\hat{I}_{\boldsymbol{w}}^{(1)}(\boldsymbol{\theta}; \boldsymbol{x})$ are high variance. Our conclusions will be based on $-\hat{I}_{\boldsymbol{w}}^{(0)}(\boldsymbol{\theta}; \boldsymbol{x})$, which is the most readable. For both nre-b and nre-c, increasing $K$ tends to positively affect the convergence rate and optimal performance (unless $\gamma$ is very large). Increasing $\gamma$ increases convergence rate for a fixed $K$. Meanwhile, smaller $\gamma$ led to slightly better optima on this task. It's possible that the low values of $\gamma$ act as a regularizer, helping to generalize from the training data on this complex task and slowing convergence.
...and 9 more figures

Theorems & Definitions (2)

Lemma 1
proof

Contrastive Neural Ratio Estimation for Simulation-based Inference

TL;DR

Abstract

Contrastive Neural Ratio Estimation for Simulation-based Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (2)