Table of Contents
Fetching ...

Surrogate-Based Bayesian Inference: Uncertainty Quantification and Active Learning

Andrew Gerard Roberts, Michael C. Dietze, Jonathan H. Huggins

Abstract

Surrogate models - also called emulators - are widely used to facilitate Bayesian inference in settings where computational costs preclude the use of standard posterior inference algorithms. Their deployment is now standard practice across many scientific domains. However, integrating surrogates in statistical analyses introduces unique challenges that complicate established Bayesian workflow principles. While significant progress has been made in addressing these issues, the relevant developments are scattered across several distinct research communities, with different emphases and perspective. We present a unifying review that synthesizes the literature into a coherent framework, aiming to benefit both practitioners and methods developers. We place particular emphasis on propagating surrogate uncertainty and sequentially refining emulators via active learning, two key components of a robust surrogate-based Bayesian workflow.

Surrogate-Based Bayesian Inference: Uncertainty Quantification and Active Learning

Abstract

Surrogate models - also called emulators - are widely used to facilitate Bayesian inference in settings where computational costs preclude the use of standard posterior inference algorithms. Their deployment is now standard practice across many scientific domains. However, integrating surrogates in statistical analyses introduces unique challenges that complicate established Bayesian workflow principles. While significant progress has been made in addressing these issues, the relevant developments are scattered across several distinct research communities, with different emphases and perspective. We present a unifying review that synthesizes the literature into a coherent framework, aiming to benefit both practitioners and methods developers. We place particular emphasis on propagating surrogate uncertainty and sequentially refining emulators via active learning, two key components of a robust surrogate-based Bayesian workflow.
Paper Structure (63 sections, 5 theorems, 72 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 63 sections, 5 theorems, 72 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma A.1

[lemma]lemma:pred-mean-dist Let $f_{N + B}^{\boldsymbol{\theta},\boldsymbol{\varphi}} \sim \mathcal{GP}(\mu_{N + B}^{\boldsymbol{\theta},\boldsymbol{\varphi}}, k_{N + B}^{\boldsymbol{\theta}})$ denote the GP $f_{N}$ conditioned on the additional training points $\{\boldsymbol{\theta},\boldsymbol{\va

Figures (5)

  • Figure 1: (Modular Surrogate-Based Bayesian Workflow) For simplicity, the diagram shows the case where noiseless target evaluations $\mathsf{f}(\theta)$ are directly available, but the workflow also applies to the case where queries yield noisy or indirect information (see \ref{['sec:reg-emulators']}). The broad workflow stages consist of (1) Initial design: generate an initial set of training data by running the simulator and evaluating the function targeted for emulation; (2) Surrogate training: fit a probabilistic surrogate to the training set; (3) Uncertainty propagation: construct an uncertainty-aware posterior approximation using the emulator. The final component, active learning, describes the iterative process of augmenting the training data and updating the emulator. The current approximate posterior is sometimes used in informing the selection of new query points (dashed arrow). The gray, shaded boxes highlight the steps that require calls to the expensive simulator.
  • Figure 2: Graphical representation of a joint Bayesian model, with a "cut" (red dashed line) that prevents feedback from $y$ to $\mathsf{f}$. Shaded squares are observed variables (data $y$ and simulator queries $\mathcal{D}$), while circles are unobserved variables (parameters $\theta$ and target map $\mathsf{f}$). The joint Bayesian approach constructs a probability distribution over all quantities, then computes $p(\theta, \mathsf{f} \mid y, \mathcal{D})$. The cut model severs feedback so that $y$ does not affect inference for $\mathsf{f}$.
  • Figure 3: Pushforward distributions induced by GP forward model emulator (left column), GP log-posterior emulator (middle column), and a clipped (upper bounded) GP log-posterior emulator (right column). The plots summarize the pointwise marginal distributions of each quantity; in particular, the means (magenta lines) and 95% credible intervals (shaded regions). The black lines are ground truth (no emulation) and the gray dashed lines indicate the locations of the design inputs used to train the surrogates. The respective rows represent the surrogate-induced distributions over the (1) forward model, (2) unnormalized log-posterior density, (3) unnormalized posterior density, and (4) normalized posterior density. The top middle and top right entries are blank because the log-density surrogates do not produce an approximation of the forward model. The magenta lines in the third and fourth rows represent the unnormalized EUP and the EP, respectively (see \ref{['sec:post-approx']}).
  • Figure 4: A continuation of the example in \ref{['fig:em_dist_1d']}. From left to right, the columns correspond to the GP forward model emulator, GP log-posterior emulator, and clipped (upper bounded) GP log-posterior emulator. The top row summarizes the pointwise marginal distributions of the log unnormalized posterior approximation, showing the mean (magenta line) and 90% intervals against the true log-posterior (black). The bottom row presents different normalized posterior approximations relative to the true posterior (blue line). The vertical dashed lines indicate the locations of the design points.
  • Figure 5: A continuation of \ref{['fig:em_dist_1d', 'fig:post_norm_approx_1d']}. The shaded regions summarize the surrogate pushforward distributions as before. The lines correspond to different "maximum uncertainty" criteria targeting uncertainty in the respective distributions. For example, the blue lines in the left column (from top to bottom) show $-\mathrm{Var}(f_{N}(\theta))$, $-\mathrm{Var}(\log {\widetilde{\pi}}_{N}(\theta))$, $-\mathrm{Var}({\widetilde{\pi}}_{N}(\theta))$, and $-\mathrm{Var}({\pi}_{N}(\theta))$, where $f_{N}$ is a forward model emulator. The two other lines similarly show the (negated) pointwise entropy and interquartile range, and the columns vary the underlying emulator. The star markers indicate the optimal value for each acquisition function.

Theorems & Definitions (35)

  • Remark 1
  • Example 1: Parameter Estimation for ODEs
  • Definition 1: Emulator Target
  • Definition 2: Simulator Observation Process
  • Example 2: Forward Model Target
  • Example 3: Log-Likelihood Target
  • Example 4: Noisy Forward Model Target
  • Example 5: ABC/SL Noisy Log-Likelihood Target
  • Example 6: Pseudo-Marginal Noisy Log-Likelihood Target
  • Example 7: Conditional Density Target
  • ...and 25 more