Unifying Summary Statistic Selection for Approximate Bayesian Computation

Till Hoffmann; Jukka-Pekka Onnela

Unifying Summary Statistic Selection for Approximate Bayesian Computation

Till Hoffmann, Jukka-Pekka Onnela

TL;DR

This work shows that minimizing the expected posterior entropy (EPE) under the prior predictive provides a unifying framework for learning informative summary statistics in approximate Bayesian computation. It demonstrates that many established information-theoretic approaches are equivalent to, or special cases of, EPE minimization, and it proposes practical conditional density-estimation methods (e.g., mixture density networks) to automatically learn high-fidelity summaries. Through three diverse problems—a multimodal benchmark, a population-genetics model, and a dynamic network model of growing trees—the authors show that EPE-based summaries can yield posterior inferences competitive with dedicated likelihood-based approaches, offering a powerful general tool for likelihood-free inference. The study also clarifies key distinctions among sufficient, lossless, and optimal summaries and provides guidance on when and how to apply EPE-based compression in practice.

Abstract

Extracting low-dimensional summary statistics from large datasets is essential for efficient (likelihood-free) inference. We characterize three different classes of summaries and demonstrate their importance for correctly analyzing dimensionality reduction algorithms. We demonstrate that minimizing the expected posterior entropy (EPE) under the prior predictive distribution of the model provides a unifying principle that subsumes many existing methods; they are shown to be equivalent to, or special or limiting cases of, minimizing the EPE. We offer a unifying framework for obtaining informative summaries and propose a practical method using conditional density estimation to learn high-fidelity summaries automatically. We evaluate this approach on diverse problems, including a challenging benchmark model with a multi-modal posterior, a population genetics model, and a dynamic network model of growing trees. The results show that EPE-minimizing summaries can lead to posterior inference that is competitive with, and in some cases superior to, dedicated likelihood-based approaches, providing a powerful and general tool for practitioners.

Unifying Summary Statistic Selection for Approximate Bayesian Computation

TL;DR

Abstract

Paper Structure (22 sections, 1 theorem, 39 equations, 6 figures, 1 table)

This paper contains 22 sections, 1 theorem, 39 equations, 6 figures, 1 table.

Introduction
Background
Minimizing the expected posterior entropy
Related work and connections with expected posterior entropy
Approximate sufficiency
Minimizing the conditional posterior entropy
Maximizing the Fisher information
Minimizing the Bayes risk
Maximizing the mutual information
Model selection
Conditional posterior density estimation
Partial least squares regression
Experiments
Evaluation criteria and model architecture for nonlinear methods
Benchmark model
...and 7 more sections

Key Result

Proposition 1

Let $\theta\sim\pi\left(\theta\right)$, $z\sim g\left(z\mid\theta\right)$, and $t\in\mathcal{T}$ be a deterministic function. Then $t=t(z)$ is a sufficient statistic for $g\left(z\mid\theta\right)$ if and only if where $I\left\{\cdot,\cdot\right\}$ denotes the mutual information between two random variables.

Figures (6)

Figure 1: Different methods for compressing data to informative summaries are intimately related; distinguishing between classes of summaries is essential. Panel (a) illustrates that five information-theoretic approaches (ITAs) are equivalent. They implicitly minimize the same loss (\ref{['sec:background', 'sec:epe']}). Approximate sufficiency (\ref{['sec:approximate-sufficiency']}) seeks to achieve lossless compression, and minimizing the posterior entropy (\ref{['sec:nunes']}) is a special case of ITAs focused on only the observed data. Maximizing Fisher information (\ref{['sec:fisher']}) and minimizing $L^2$ Bayes risk (\ref{['sec:bayes-risk']}) are equivalent each other and ITAs in the large-sample limit. Probabilistic model selection (\ref{['sec:model-selection']}) maps onto ITAs if we treat model labels as parameters. A dashed arrow from one method to another indicates that the latter is a specialization of the former. Solid arrows indicate correspondence in the large-sample limit. Panel (b) illustrates relationships between classes of summaries. Sufficient statistics $\mathcal{S}$ are a subset of lossless statistics $\mathcal{L}$ although the former only exist if the likelihood belongs to the exponential family. The intersection of lossless summaries $\mathcal{L}$ and the summaries $\mathcal{T}$ considered by the practitioner are optimal summaries $\mathcal{O}$. Optimal summaries are not necessarily lossless, e.g. if $\mathcal{T}$ is restricted to parametric transformations.
Figure 2: Extracting summaries can be non-trivial even for toy models. Panel (a) shows the difference between posterior and prior entropy for a model with zero-mean normal likelihood and conjugate gamma prior for the precision $\theta$ (inverse variance). For a subset of the prior and data space, minimizing the posterior entropy discards the second moment $t$, a sufficient statistic. Panel (b) shows the bimodal posterior for the example point in (a) that arises when the precision of the likelihood is $\mathop{\mathrm{abs}}\nolimits\left(\theta\right)$ (see \ref{['sec:bayes-risk']}). The posterior mean is zero and not informative of the parameter. The vertical dashed line represents the maximum likelihood estimate $\widehat{\mathop{\mathrm{abs}}\nolimits\left(\theta\right)}$ of the precision $\mathop{\mathrm{abs}}\nolimits\left(\theta\right)$.
Figure 3: Optimal summaries depend on the prior. Panel (a) shows the parameters of a piecewise likelihood with qualitatively different behaviour on either side of the transition at $\theta=0$. Panel (c) shows two priors with the bulk of their mass on either side of the transition. Panels (b) and (d) show the relationship between the parameter and the sample mean $\bar{y}$ and log variance $\log\mathop{\mathrm{var}}\nolimits y$, respectively, as a scatter plot. Mutual information estimates highlight that the optimal choice of summary depends on the prior: The $\bar{y}$ and $\log\mathop{\mathrm{var}}\nolimits y$ summaries are informative for the priors centred at $+1$ and $-1$, respectively.
Figure 4: Mixture density networks with a bottleneck can learn informative summaries. The stack left of the compressor $t$ illustrates the training data generation and MDN training procedure: $p$-dimensional parameters $\theta$ and synthetic data $z$ are drawn from the prior $\pi$ and simulator $g$, respectively. Synthetic data are compressed to summaries using a compressor $t$. The stack right of the compressor $t$ illustrates approximate Bayesian computation using learned summaries: The compressor evaluates summaries of observed data $y$, and parameter samples are accepted if corresponding simulated summaries $t\left(z\right)$ are sufficiently close to observed summaries $t\left(y\right)$. The red dashed box indicates components specific to training MDN compression: A mixture density network (MDN) $h$ estimates a posterior approximation $\hat{f}\left(\theta\mid t(z)\right)$ given learned summaries $t(z)$. Here, $\mathcal{F}$ are the supported posteriors, e.g. MDNs with certain component distributions. The network is trained by minimizing the negative log probability (NLP) loss. The table lists the type of data $\mathbb{D}$ and compressor architecture for each experiment (see \ref{['sec:benchmark', 'sec:population-genetics', 'sec:tree']} for details).
Figure 5: A conditional mixture density network (MDN) that minimizes the expected posterior entropy learns highly informative summaries. Panel (a) shows the likelihood for the true parameter $\theta^*\approx 1.6$ that generated the example dataset $y$ together with a rug plot for the $n=10$ observations $y_{\bullet 1}$. Panel (b) shows the true posterior $f\left(\theta\mid y\right)$ together with the learned posterior density estimator. While the two-component mixture is not flexible enough to approximate the true posterior well, it learns highly informative summaries: MDN-compressed ABC samples using these summaries are shown as a histogram. Panel (c) shows the learned summary function $t: \mathbb{R}^{10 \times 2} \to \mathbb{R}$ which maps the full data matrix to a scalar; the plot shows $t(y)$ as a function of the first column values $y_{\bullet1}$ (the informative data, with the second column being uninformative noise). The dashed line shows how $t$ can be approximated using polynomial basis functions of the candidate summaries (the first three even moments). Panel (d) illustrates the relationship between the posterior density estimator and the summary as a heat map; lighter colours indicate higher posterior density.
...and 1 more figures

Theorems & Definitions (1)

Proposition 1

Unifying Summary Statistic Selection for Approximate Bayesian Computation

TL;DR

Abstract

Unifying Summary Statistic Selection for Approximate Bayesian Computation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)