Simulation-based inference with deep ensembles: Evaluating calibration uncertainty and detecting model misspecification
James Alvey, Carlo R. Contaldi, Mauro Pieroni
TL;DR
This paper addresses the challenge of validating SBI posteriors without access to the true posterior by proposing an ensemble-based KL-divergence diagnostic. By training multiple SBI estimators on the same simulations and computing the pairwise KL divergences between their posteriors, the KL divergence matrix quantifies ensemble consistency and highlights potential issues from undertraining to model misspecification. The authors connect the KL matrix to systematic training uncertainty, demonstrate its behavior on SBI benchmarks, and show how misfit observations lead to increased ensemble disagreement. This approach provides a scalable, model-agnostic tool to increase the reliability and interpretability of SBI results in scientific applications, with clear pathways for extension to other divergences and calibration techniques.
Abstract
Simulation-Based Inference (SBI) offers a principled and flexible framework for conducting Bayesian inference in any situation where forward simulations are feasible. However, validating the accuracy and reliability of the inferred posteriors remains a persistent challenge. In this work, we point out a simple diagnostic approach rooted in ensemble learning methods to assess the internal consistency of SBI outputs that does not require access to the true posterior. By training multiple neural estimators under identical conditions and evaluating their pairwise Kullback-Leibler (KL) divergences, we define a consistency criterion that quantifies agreement across the ensemble. We highlight two core use cases for this framework: a) for generating a robust estimate of the systematic uncertainty in parameter reconstruction associated with the training procedure, and b) for detecting possible model misspecification when using trained estimators on real data. We also demonstrate the relationship between significant KL divergences and issues such as insufficient convergence due to, e.g., too low a simulation budget, or intrinsic variance in the training process. Overall, this ensemble-based diagnostic framework provides a lightweight, scalable, and model-agnostic tool for enhancing the trustworthiness of SBI in scientific applications.
