Table of Contents
Fetching ...

Simulation-based inference with deep ensembles: Evaluating calibration uncertainty and detecting model misspecification

James Alvey, Carlo R. Contaldi, Mauro Pieroni

TL;DR

This paper addresses the challenge of validating SBI posteriors without access to the true posterior by proposing an ensemble-based KL-divergence diagnostic. By training multiple SBI estimators on the same simulations and computing the pairwise KL divergences between their posteriors, the KL divergence matrix quantifies ensemble consistency and highlights potential issues from undertraining to model misspecification. The authors connect the KL matrix to systematic training uncertainty, demonstrate its behavior on SBI benchmarks, and show how misfit observations lead to increased ensemble disagreement. This approach provides a scalable, model-agnostic tool to increase the reliability and interpretability of SBI results in scientific applications, with clear pathways for extension to other divergences and calibration techniques.

Abstract

Simulation-Based Inference (SBI) offers a principled and flexible framework for conducting Bayesian inference in any situation where forward simulations are feasible. However, validating the accuracy and reliability of the inferred posteriors remains a persistent challenge. In this work, we point out a simple diagnostic approach rooted in ensemble learning methods to assess the internal consistency of SBI outputs that does not require access to the true posterior. By training multiple neural estimators under identical conditions and evaluating their pairwise Kullback-Leibler (KL) divergences, we define a consistency criterion that quantifies agreement across the ensemble. We highlight two core use cases for this framework: a) for generating a robust estimate of the systematic uncertainty in parameter reconstruction associated with the training procedure, and b) for detecting possible model misspecification when using trained estimators on real data. We also demonstrate the relationship between significant KL divergences and issues such as insufficient convergence due to, e.g., too low a simulation budget, or intrinsic variance in the training process. Overall, this ensemble-based diagnostic framework provides a lightweight, scalable, and model-agnostic tool for enhancing the trustworthiness of SBI in scientific applications.

Simulation-based inference with deep ensembles: Evaluating calibration uncertainty and detecting model misspecification

TL;DR

This paper addresses the challenge of validating SBI posteriors without access to the true posterior by proposing an ensemble-based KL-divergence diagnostic. By training multiple SBI estimators on the same simulations and computing the pairwise KL divergences between their posteriors, the KL divergence matrix quantifies ensemble consistency and highlights potential issues from undertraining to model misspecification. The authors connect the KL matrix to systematic training uncertainty, demonstrate its behavior on SBI benchmarks, and show how misfit observations lead to increased ensemble disagreement. This approach provides a scalable, model-agnostic tool to increase the reliability and interpretability of SBI results in scientific applications, with clear pathways for extension to other divergences and calibration techniques.

Abstract

Simulation-Based Inference (SBI) offers a principled and flexible framework for conducting Bayesian inference in any situation where forward simulations are feasible. However, validating the accuracy and reliability of the inferred posteriors remains a persistent challenge. In this work, we point out a simple diagnostic approach rooted in ensemble learning methods to assess the internal consistency of SBI outputs that does not require access to the true posterior. By training multiple neural estimators under identical conditions and evaluating their pairwise Kullback-Leibler (KL) divergences, we define a consistency criterion that quantifies agreement across the ensemble. We highlight two core use cases for this framework: a) for generating a robust estimate of the systematic uncertainty in parameter reconstruction associated with the training procedure, and b) for detecting possible model misspecification when using trained estimators on real data. We also demonstrate the relationship between significant KL divergences and issues such as insufficient convergence due to, e.g., too low a simulation budget, or intrinsic variance in the training process. Overall, this ensemble-based diagnostic framework provides a lightweight, scalable, and model-agnostic tool for enhancing the trustworthiness of SBI in scientific applications.

Paper Structure

This paper contains 12 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: Interpreting the KL divergence: A set of examples illustrating quantitatively how the KL divergence depends on the differences in the means/variances of 1-d Normal distributions. Here, the base distribution $\mathcal{N}_1$ has $\mu_1=0$ and $\sigma_1=1$ such that $\mu_2 = \gamma$ and $\sigma_2=1+\epsilon$ are the parameters of the contrasting distribution $\mathcal{N}_2$. Left: Contour plot showing the value of the KL divergence for different values of $\epsilon$ and $\gamma$. Also shown (in red) is the quadratic approximation derived in Eq. \ref{['eq:KL_approx']}, which we see matches well for small divergences. The coloured stars correspond to specific values reported in the legend. Right: 1d histograms showing the PDFs for the values of $\epsilon$ and $\gamma$ corresponding to the stars in the left plot.
  • Figure 2: KL matrix as a training diagnostic.Left: Mean of the KL matrix as a function of the number of training epochs for different training set sizes (corresponding to bottom legend). Middle: Value of the KL divergences in the ensemble (averaged over the last 30 epochs) at the end of the training as a function of training dataset size. In dashed grey, we show the expected $1/n_{\rm Train}$ behaviour before saturating the intrinsic variance floor. Right: Value of the loss function at the end of the training process as a function of the training dataset size.
  • Figure 3: Evolution of the posteriors. True posterior distribution (black) compared to samples drawn from the different networks in the ensembles. The different panels (top left to bottom right) correspond to different training dataset sizes (colour code as in Fig. \ref{['fig:KL_GM_all']}).
  • Figure 4: Detecting Model Misspecification.Left: Posteriors (bottom plot) obtained from networks conditioned to an observation (top plot) compatible with the data used during the training. In the top plot, the data realisation is shown in black, and the dashed orange line represents the input theoretical model $\mu_k$. The orange vertical band in the bottom plot shows the maximal expected spread among estimates given the final value of the KL matrix ($\sim 7 \times 10^{-2}$) during training. As expected, given the small KL at the end of the training, all posteriors agree. For reference, we show the true posterior in black. The dashed grey line represents the injected value of $\theta$ and the orange vertical line the mean value inferred from the data. Right: Posteriors (bottom plot) from the same networks conditioned to an observation (top plot) that is not compatible with the data used during the training. It is manifest that, once conditioned to an observation that does not match the model used in the training process, the spread in the posteriors increases significantly, which we can easily detect using the KL matrix. The orange solid and grey dashed lines indicate the same quantities as in the left-hand figure.
  • Figure 5: Tracking the amount of model misspecification.Left: Three examples of the injection templates $\mu_k = \theta + a \mathcal{B}_k$, corresponding to the dashed vertical lines in the right-hand figure. The error bars correspond to the $1\sigma$ noise levels in each bin, and the horizontal dashed lines indicate the equivalent flat template with $a = 0$. Right: The elements of the KL matrix as a function of the amplitude of the bump. The grey bands are the percentiles computed among networks in the regime where the bump contribution is negligible compared to the flat component. The remaining curves indicate the behaviour of the KL divergence associated with each network combination. The mean (light blue) and maximum (orange) curves are specifically highlighted to be used as a diagnostic for detecting model misspecification.
  • ...and 2 more figures