Understanding and mitigating difficulties in posterior predictive evaluation
Abhinav Agrawal, Justin Domke
TL;DR
This work addresses the challenge of unreliable posterior predictive density (PPD) evaluation under approximate Bayesian inference due to a potentially tiny signal-to-noise ratio (SNR) in naive Monte Carlo estimators. It derives exact-inference and approximate-inference forms for the SNR, showing exponential decay governed by data mismatch, latent dimensionality, and the test-to-training data ratio, with delta quantified via KL divergences or log-partition constants. To mitigate this, the authors introduce Learned Importance Sampling (LIS), which learns a test-time proposal by maximizing the IW-ELBO and uses an importance-weighted estimator to achieve substantially higher SNR and more accurate log PPD estimates across diverse models, including exponential families, linear and logistic regression, and a large-scale hierarchical MovieLens model. The results demonstrate that LIS can dramatically improve PPD estimation quality, enabling more reliable model comparison and evaluation when exact inference is impractical. Overall, the approach enhances the reliability of PPD-based evaluation in Bayesian workflows and offers a practical, scalable option for improving inference diagnostics in complex models.
Abstract
Predictive posterior densities (PPDs) are of interest in approximate Bayesian inference. Typically, these are estimated by simple Monte Carlo (MC) averages using samples from the approximate posterior. We observe that the signal-to-noise ratio (SNR) of such estimators can be extremely low. An analysis for exact inference reveals SNR decays exponentially as there is an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of the test data relative to the training data. Further analysis extends these results to approximate inference. To remedy the low SNR problem, we propose replacing simple MC sampling with importance sampling using a proposal distribution optimized at test time on a variational proxy for the SNR and demonstrate that this yields greatly improved estimates.
