Table of Contents
Fetching ...

Tests for model misspecification in simulation-based inference: from local distortions to global model checks

Noemi Anau Montel, James Alvey, Christoph Weniger

TL;DR

This work tackles misspecification in simulation-based inference (SBI) by introducing distortion-driven tests that treat a base simulator as H0 and a broad ensemble of augmented simulators as Hi. Central to the approach are localized test statistics t_i(x) = -2 \ln \frac{p(x|H0)}{p(x|Hi)} and their aggregate t_{sum}(x) = \sum_i t_i(x), with global p-values computed via Monte Carlo sampling to account for multiple correlated tests. The authors present two training strategies—BCE (classifier-based) and SNR (matched-filter-based)—to efficiently learn these statistics, and demonstrate connections to classical frameworks (matched filtering, $\chi^2$ goodness-of-fit). They validate the framework on a toy example and apply it to GW150914, showing no significant misspecification while providing a rich diagnostic tool for end-to-end SBI analyses. An adaptive, self-calibrating distortions algorithm further enhances practical applicability by tuning distortion amplitudes to remain plausible given observational noise. The approach offers a flexible, principled path toward robust SBI pipelines in physics and astrophysics, enabling thorough discrepancy detection and interpretation beyond parameter estimation.

Abstract

Model misspecification analysis strategies, such as anomaly detection, model validation, and model comparison are a key component of scientific model development. Over the last few years, there has been a rapid rise in the use of simulation-based inference (SBI) techniques for Bayesian parameter estimation, applied to increasingly complex forward models. To move towards fully simulation-based analysis pipelines, however, there is an urgent need for a comprehensive simulation-based framework for model misspecification analysis. In this work, we provide a solid and flexible foundation for a wide range of model discrepancy analysis tasks, using distortion-driven model misspecification tests. From a theoretical perspective, we introduce the statistical framework built around performing many hypothesis tests for distortions of the simulation model. We also make explicit analytic connections to classical techniques: anomaly detection, model validation, and goodness-of-fit residual analysis. Furthermore, we introduce an efficient self-calibrating training algorithm that is useful for practitioners. We demonstrate the performance of the framework in multiple scenarios, making the connection to classical results where they are valid. Finally, we show how to conduct such a distortion-driven model misspecification test for real gravitational wave data, specifically on the event GW150914.

Tests for model misspecification in simulation-based inference: from local distortions to global model checks

TL;DR

This work tackles misspecification in simulation-based inference (SBI) by introducing distortion-driven tests that treat a base simulator as H0 and a broad ensemble of augmented simulators as Hi. Central to the approach are localized test statistics t_i(x) = -2 \ln \frac{p(x|H0)}{p(x|Hi)} and their aggregate t_{sum}(x) = \sum_i t_i(x), with global p-values computed via Monte Carlo sampling to account for multiple correlated tests. The authors present two training strategies—BCE (classifier-based) and SNR (matched-filter-based)—to efficiently learn these statistics, and demonstrate connections to classical frameworks (matched filtering, goodness-of-fit). They validate the framework on a toy example and apply it to GW150914, showing no significant misspecification while providing a rich diagnostic tool for end-to-end SBI analyses. An adaptive, self-calibrating distortions algorithm further enhances practical applicability by tuning distortion amplitudes to remain plausible given observational noise. The approach offers a flexible, principled path toward robust SBI pipelines in physics and astrophysics, enabling thorough discrepancy detection and interpretation beyond parameter estimation.

Abstract

Model misspecification analysis strategies, such as anomaly detection, model validation, and model comparison are a key component of scientific model development. Over the last few years, there has been a rapid rise in the use of simulation-based inference (SBI) techniques for Bayesian parameter estimation, applied to increasingly complex forward models. To move towards fully simulation-based analysis pipelines, however, there is an urgent need for a comprehensive simulation-based framework for model misspecification analysis. In this work, we provide a solid and flexible foundation for a wide range of model discrepancy analysis tasks, using distortion-driven model misspecification tests. From a theoretical perspective, we introduce the statistical framework built around performing many hypothesis tests for distortions of the simulation model. We also make explicit analytic connections to classical techniques: anomaly detection, model validation, and goodness-of-fit residual analysis. Furthermore, we introduce an efficient self-calibrating training algorithm that is useful for practitioners. We demonstrate the performance of the framework in multiple scenarios, making the connection to classical results where they are valid. Finally, we show how to conduct such a distortion-driven model misspecification test for real gravitational wave data, specifically on the event GW150914.

Paper Structure

This paper contains 20 sections, 39 equations, 7 figures.

Figures (7)

  • Figure 1: Summary illustration of the presented framework for tests of model misspecification in SBI (see Section \ref{['sec:method']} for details). Left panel: An ensemble of localized test statistics is learned by neural networks (see Appendix \ref{['app:BCE']} for details); they are typically more sensitive towards isolated distortions and in some limits can be the basis for anomaly detection. Their individual significance can be quantified with Monte-Carlo estimates, and in specific training scenarios (see Appendix \ref{['app:SNR']} for details) one can visualize the distortions to the model in data space through residuals. Right panel: Aggregated test statistics can be constructed given any subset of localized test statistics; they are sensitive towards the cumulative evidence of multiple distortions and in some limits can be the basis for model validation tests. Their individual significance can be quantified with Monte-Carlo estimates, and in specific training scenarios one can perform a residual variance analysis. Central panel: We can estimate the overall global significance of all the performed tests, accounting for their correlation.
  • Figure 2: Comprehensive summary of the framework results for the instructive example presented in Section \ref{['sec:example']}. The panels show the results by following the structure of our framework, as represented in the summary graphic Figure \ref{['fig:summary']}. The upper-left panel depicts scattered data points $\bm{x}_\mathrm{obs}$, the baseline signal $\bm{\mu}_\mathrm{sim}$, and the signal distorted by an additive stochastic distortion of type B, $\bm{\mu}_\mathrm{distB}$, as described in Section \ref{['sec:example']}. The gray bands highlights the 1-, 2-, and 3-$\sigma$ regions of the baseline Gaussian noise. The upper-right panel visualizes the three types of deviations with different correlation scale under investigation. The results from different networks is color-coded based on the type of distortion they were trained on in the following panels. The center-left panel showcases the significance of localized distortions, the center-right panel the significance of the aggregated distortions, and the bottom text the global significance. Finally, the bottom panels the strength of the distortions in data space and their variance. These latter results are achievable only through the SNR training strategy (see Section \ref{['subsec:training']}).
  • Figure 3: Same as Figure \ref{['fig:plot1']}, but applied to data distorted by many small distortions.
  • Figure 4: Illustration of the adaptive training of distortion amplitudes in our framework. The figure shows how the generated distortions (color-coded to match the legend in Figure \ref{['fig:plot1']}) dynamically adjust to envelop the deterministic part of the baseline model (black dashed lines) during training. The baseline model noise is not shown in the plot for clarity purposes. By adaptively tuning the distortion amplitude parameter $b$ based on the learned variance $\sigma^2$ and a desired maximum signal-to-noise ratio $\mathrm{SNR}_\mathrm{max}$, the distortions remain plausible.
  • Figure 5: Top panel: GW150914 data for the Hanford detector and posterior-predictive distribution samples from the Bayesian inference step as described in Section \ref{['subsec:gw_inference']} and processed as described in Section \ref{['subsec:gw_process']}. Bottom panel: Results of our framework for GW150914 from the Hanford detector. We test for an independent bin-wise distortion and a correlated one, using both our training strategies, dubbed BCE and SNR respectively (Section \ref{['subsec:training']}). As expected, no significant anomaly is present in the modelling of GW150914, with global $\mathrm{p}$-values for all the types of analyses of around a few tenths.
  • ...and 2 more figures