Understanding and mitigating difficulties in posterior predictive evaluation

Abhinav Agrawal; Justin Domke

Understanding and mitigating difficulties in posterior predictive evaluation

Abhinav Agrawal, Justin Domke

TL;DR

This work addresses the challenge of unreliable posterior predictive density (PPD) evaluation under approximate Bayesian inference due to a potentially tiny signal-to-noise ratio (SNR) in naive Monte Carlo estimators. It derives exact-inference and approximate-inference forms for the SNR, showing exponential decay governed by data mismatch, latent dimensionality, and the test-to-training data ratio, with delta quantified via KL divergences or log-partition constants. To mitigate this, the authors introduce Learned Importance Sampling (LIS), which learns a test-time proposal by maximizing the IW-ELBO and uses an importance-weighted estimator to achieve substantially higher SNR and more accurate log PPD estimates across diverse models, including exponential families, linear and logistic regression, and a large-scale hierarchical MovieLens model. The results demonstrate that LIS can dramatically improve PPD estimation quality, enabling more reliable model comparison and evaluation when exact inference is impractical. Overall, the approach enhances the reliability of PPD-based evaluation in Bayesian workflows and offers a practical, scalable option for improving inference diagnostics in complex models.

Abstract

Predictive posterior densities (PPDs) are of interest in approximate Bayesian inference. Typically, these are estimated by simple Monte Carlo (MC) averages using samples from the approximate posterior. We observe that the signal-to-noise ratio (SNR) of such estimators can be extremely low. An analysis for exact inference reveals SNR decays exponentially as there is an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of the test data relative to the training data. Further analysis extends these results to approximate inference. To remedy the low SNR problem, we propose replacing simple MC sampling with importance sampling using a proposal distribution optimized at test time on a variational proxy for the SNR and demonstrate that this yields greatly improved estimates.

Understanding and mitigating difficulties in posterior predictive evaluation

TL;DR

Abstract

Paper Structure (27 sections, 23 theorems, 94 equations, 12 figures, 9 tables)

This paper contains 27 sections, 23 theorems, 94 equations, 12 figures, 9 tables.

Introduction
Analysis with exact inference
Analysis with exact inference and conjugacy
Analysis with approximate inference
Learned Importance Sampling
Experiments
Exponential Family Models
Linear Regression
Logistic Regression
Hierarchical model
Discussion
Related Works
Proof for \ref{['thm: snr monte carlo gen model form.']}
Proof for \ref{['prop: delta Bayesian CLT']}
Note for the simplification from \ref{['app: eq: KL for two gaussians substituted']} to \ref{['app: eq: KL for two gaussians substituted 2']}
...and 12 more sections

Key Result

Theorem 1

Let $R_K$ be the Monte Carlo estimator for the $\textrm{PPD}$ (eq: naive mc estimator for ppdq.) with exact inference. Let $p(z, \mathcal{D}) = p(z)\hbox{$\prod_{y \in \mathcal{D}}$}p(y \vert z)$. Then, $\textrm{SNR}\left(R_K\right) = \sqrt{K}/\sqrt{\exp(\delta)^2 - 1}$ for where $V$ is the log-normalization function $V({\mathcal{D}}) = \log \int p(\mathcal{D} \vert z) p(z) dz$.

Figures (12)

Figure 1: Left. SNR contours of the naive MC estimator for a linear regression model when sampling from the true posterior. Right. The evaluation error, given by $\log \textrm{PPD} - \log R_K$, for the linear regression model when either data mismatch, dimensionality of $z$, or size of $\mathcal{D^*}$ relative to $\mathcal{D}$ is high. Error is extremely poor and sometimes does not improve much with more samples. What explains this? How can we do better evaluation?
Figure 2: SNR rapidly decays with $\delta$.
Figure 3: Left: The log partition function $B(\xi)$ (\ref{[' eq: normalization constant for conjugate prior']}). Right. The values of $B(\xi)$ along the lines joining ${\color{mypurple}\xi_{\mathcal{D}}}$ to ${\color{red} \xi_{\mathcal{D} + {\color{red} \mathcal{D}_{1}^{*}}}}$ and ${\color{blue}\xi_{\mathcal{D} + {\color{blue}\mathcal{D}_{2}^{*}}}}$.
Figure 4: Evaluating $\textrm{PPD}$ with Learned IS.
Figure 5: $\textrm{SNR}\left(R_1\right)$contours.$\overline{T}(\mathcal{D})$ denotes the average sufficient statistics of the data points in $\mathcal{D}$. $\left\vert \mathcal{D}\right\vert = 100$ and the red dotted line indicates $\overline{T}(\mathcal{D}) = 10.$ Data mismatch increases as we move away from the red dotted line, and the relative size of $\mathcal{D^*}$ increases as we move along the horizontal axis. Either way, SNR decreases exponentially. SNR is calculated in-closed form after deriving $B$ and plugging it into $\delta$ in \ref{['eq: exp family delta form 2.']}.
...and 7 more figures

Theorems & Definitions (45)

Theorem 1
Proposition 2: Informal
Theorem 4
Theorem 6
Lemma 7
proof
Definition 8: Log-normalization function
Lemma 9
proof
Lemma 10
...and 35 more

Understanding and mitigating difficulties in posterior predictive evaluation

TL;DR

Abstract

Understanding and mitigating difficulties in posterior predictive evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (45)