Sampling-Based Accuracy Testing of Posterior Estimators for General Inference

Pablo Lemos; Adam Coogan; Yashar Hezaveh; Laurence Perreault-Levasseur

Sampling-Based Accuracy Testing of Posterior Estimators for General Inference

Pablo Lemos, Adam Coogan, Yashar Hezaveh, Laurence Perreault-Levasseur

TL;DR

The paper tackles the challenge of validating posterior estimators produced by generative models in likelihood-free and simulation-based inference. It introduces Tests of Accuracy with Random Points (TARP), a coverage-based framework that provides a necessary and sufficient condition for posterior accuracy by examining expected coverage across randomly positioned, data-dependent credible regions. The authors prove a central theorem linking correct expected coverage to exact posterior equality and demonstrate TARP's ability to detect miscalibration and uninformative posteriors, even in high-dimensional problems like gravitational lensing. Through Gaussian toy models and a high-dimensional lensing reconstruction, they show HPD-based tests can miss deficiencies that TARP reveals, underscoring TARP’s value as a robust tool for validating posterior inferences in modern SBI workflows.

Abstract

Parameter inference, i.e. inferring the posterior distribution of the parameters of a statistical model given some data, is a central problem to many scientific disciplines. Generative models can be used as an alternative to Markov Chain Monte Carlo methods for conducting posterior inference, both in likelihood-based and simulation-based problems. However, assessing the accuracy of posteriors encoded in generative models is not straightforward. In this paper, we introduce `Tests of Accuracy with Random Points' (TARP) coverage testing as a method to estimate coverage probabilities of generative posterior estimators. Our method differs from previously-existing coverage-based methods, which require posterior evaluations. We prove that our approach is necessary and sufficient to show that a posterior estimator is accurate. We demonstrate the method on a variety of synthetic examples, and show that TARP can be used to test the results of posterior inference analyses in high-dimensional spaces. We also show that our method can detect inaccurate inferences in cases where existing methods fail.

Sampling-Based Accuracy Testing of Posterior Estimators for General Inference

TL;DR

Abstract

Paper Structure (20 sections, 5 theorems, 31 equations, 12 figures)

This paper contains 20 sections, 5 theorems, 31 equations, 12 figures.

Introduction
Formalism
Notation
Coverage probability
Expected coverage probability
Our method
High posterior density coverage testing
Distance to random point coverage testing
Experiments
Gaussian Toy Model
Dependence on $\theta_r$ distribution and distance metric
Revealing when estimators are uninformative
Gravitational Lensing
Conclusions
Broader Impact
...and 5 more sections

Key Result

Theorem 1

The posterior has coverage probability $\operatorname{CP}(p, \alpha, x, \mathcal{G}) = 1 - \alpha$ for all values of $x$ and any credible region generator $\mathcal{G}(p, \alpha, x)$.

Figures (12)

Figure 1: A graphical illustration of the proposed coverage test for assessing the quality of a posterior estimator $\hat{p}$. Given a set of simulations (panels), we draw samples from the posterior estimator (orange points). We sample a reference parameter point $\theta_r$, and determine the fraction of points $f$ falling within a ball centered on $\theta_r$ extending to the true parameter point $\theta^*$ used to generate the simulation (ball indicated in yellow, $f$ indicated below each panel). Our coverage test aggregates the statistics of $f$, providing a necessary and sufficient way to guarantee the accuracy of $\hat{p}$.
Figure 2: Results on the Gaussian toy model for all four cases described in \ref{['sec:toy']}. The red line shows the method presented in this paper, while the blue shows the HPD region.
Figure 3: An example of one of the lensing simulations performed. The top panels show the (latent) source plane that we are trying to infer, while the bottom panels show the distorted images. From left to right, the plot shows the truth, mean, and standard deviation of the samples from the posterior estimator (in the case of this figure, the 'exact' estimator), and the residuals. The noise in the observations is set to 1 on the color scales shown here.
Figure 4: Expected coverage vs credibility level for the uninformative posterior estimator described in \ref{['sec:posterior-prior']}. The blue line shows the coverage calculated using HPD regions, while the red lines use TARP regions. The continuous line uses reference points that are independent of $x$, while the dot-dashed line uses reference points that depend on $x$.
Figure 5: Expected coverage probability vs credibility level for our lensing example, for which tests based on HPD coverage are intractable. We see how, as expected, the exact posterior estimator (blue) accurately characterizes the posterior while the biased estimator (orange) does not.
...and 7 more figures

Theorems & Definitions (14)

Definition 1
Definition 2
Definition 3
Definition 4
Theorem 1
Definition 5
Theorem 2
Theorem 3
Definition 6
Remark 1
...and 4 more

Sampling-Based Accuracy Testing of Posterior Estimators for General Inference

TL;DR

Abstract

Sampling-Based Accuracy Testing of Posterior Estimators for General Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (14)