Table of Contents
Fetching ...

Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks: An Extended Investigation

Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, Stefan T. Radev

TL;DR

The paper tackles the reliability of amortized SBI when simulators are misspecified. It proposes an unsupervised misspecification measure based on Maximum Mean Discrepancy (MMD) in a learned summary space to detect distribution shifts between the model-implied and true data-generating processes, linking gaps to posterior distortion. By augmenting neural posterior estimation with a structured summary objective and a finite-data MMD test, the authors demonstrate across five diverse experiments that the method detects simulation gaps and signals trustworthiness of inferences, guiding simulator refinement and potential post-hoc corrections. The approach, implementable within BayesFlow, provides a practical, broadly applicable tool for robust SBI under model misspecification with direct real-world impact on scientific applications and high-stakes decision making.

Abstract

Recent advances in probabilistic deep learning enable efficient amortized Bayesian inference in settings where the likelihood function is only implicitly defined by a simulation program (simulation-based inference; SBI). But how faithful is such inference if the simulation represents reality somewhat inaccurately, that is, if the true system behavior at test time deviates from the one seen during training? We conceptualize the types of such model misspecification arising in SBI and systematically investigate how the performance of neural posterior approximators gradually deteriorates as a consequence, making inference results less and less trustworthy. To notify users about this problem, we propose a new misspecification measure that can be trained in an unsupervised fashion (i.e., without training data from the true distribution) and reliably detects model misspecification at test time. Our experiments clearly demonstrate the utility of our new measure both on toy examples with an analytical ground-truth and on representative scientific tasks in cell biology, cognitive decision making, disease outbreak dynamics, and computer vision. We show how the proposed misspecification test warns users about suspicious outputs, raises an alarm when predictions are not trustworthy, and guides model designers in their search for better simulators.

Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks: An Extended Investigation

TL;DR

The paper tackles the reliability of amortized SBI when simulators are misspecified. It proposes an unsupervised misspecification measure based on Maximum Mean Discrepancy (MMD) in a learned summary space to detect distribution shifts between the model-implied and true data-generating processes, linking gaps to posterior distortion. By augmenting neural posterior estimation with a structured summary objective and a finite-data MMD test, the authors demonstrate across five diverse experiments that the method detects simulation gaps and signals trustworthiness of inferences, guiding simulator refinement and potential post-hoc corrections. The approach, implementable within BayesFlow, provides a practical, broadly applicable tool for robust SBI under model misspecification with direct real-world impact on scientific applications and high-stakes decision making.

Abstract

Recent advances in probabilistic deep learning enable efficient amortized Bayesian inference in settings where the likelihood function is only implicitly defined by a simulation program (simulation-based inference; SBI). But how faithful is such inference if the simulation represents reality somewhat inaccurately, that is, if the true system behavior at test time deviates from the one seen during training? We conceptualize the types of such model misspecification arising in SBI and systematically investigate how the performance of neural posterior approximators gradually deteriorates as a consequence, making inference results less and less trustworthy. To notify users about this problem, we propose a new misspecification measure that can be trained in an unsupervised fashion (i.e., without training data from the true distribution) and reliably detects model misspecification at test time. Our experiments clearly demonstrate the utility of our new measure both on toy examples with an analytical ground-truth and on representative scientific tasks in cell biology, cognitive decision making, disease outbreak dynamics, and computer vision. We show how the proposed misspecification test warns users about suspicious outputs, raises an alarm when predictions are not trustworthy, and guides model designers in their search for better simulators.
Paper Structure (30 sections, 22 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 22 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: Conceptual overview of our neural approach. The summary network $h_\psi$ maps observations $\boldsymbol{x}$ to summary statistics $h_\psi(\boldsymbol{x})$, and the inference network $f_\phi$ estimates the posterior $p(\boldsymbol{\theta}\,|\,\boldsymbol{x},\mathcal{M})$ from the summary statistics. The generative model $\mathcal{M}$ creates training data $\boldsymbol{x}$ in the green region, and the networks learn to map these data to well-defined summary statistics and posteriors (green regions/dot/box). If the generative model $\mathcal{M}$ is misspecificed, real observations $\accentset{\text{o}}{\boldsymbol{x}}$ fall outside the training region and are therefore mapped to outlying summary statistics and potentially incorrect posteriors (red dots/box). Since our learning approach enforces a known inlier summary distribution (e.g., Gaussian), misspecification can be detected by a distribution mismatch in summary space, as signaled by a high maximum mean discrepancy score Gretton2012.
  • Figure 2: Preview of Experiment 3 on reaction time modeling in psychological experiments. Posteriors obtained via non-amortized MCMC (HMC in Stan) and amortized simulation-based inference (NPE in BayesFlow) are very similar when the model is well-specified (left). However, a simulation gap (here: not accounting for occasional slow responses due to mind wandering) leads to considerable disagreement between these methods (right).
  • Figure 3: Experiment 1. Summary space samples for the minimal sufficient summary network ($S=2$) from a well-specified model $\mathcal{M}$ (blue) and several misspecified configurations. Left: Prior misspecification can be detected. Right: Noise misspecification can be detected, while simulator scale misspecification is indistinguishable from the validation summary statistics.
  • Figure 4: Experiment 1. Summary space discrepancy (MMD to training distribution) and posterior error (RMSE of correct vs. analytic posterior means) as a function of misspecification severity. White stars indicate the well-specified model configuration (i.e., equal to the training model $\mathcal{M}$), where both MMD and posterior error are low.
  • Figure 5: Experiment 2. MMD increases with misspecification severity (\ref{['fig:cs:mms-mmd']}; mean, SD of 20 repetitions). Our test easily detects the setting from ward_robust_2022 (\ref{['fig:cs:power']}).
  • ...and 12 more figures