Table of Contents
Fetching ...

Predictive variational inference: Learn the predictively optimal posterior distribution

Jinlin Lai, Yuling Yao

TL;DR

Predictive Variational Inference (PVI) reframes posterior inference as optimizing the posterior predictive distribution to be close to the data-generating process under a chosen scoring rule, rather than approximating the exact Bayesian posterior. By using flexible variational families (e.g., normalizing flows) and regularization that can interpolate toward Bayesian posteriors, PVI yields predictive-optimal posteriors that may differ from traditional Bayes, especially under model misspecification, and can reveal population-level heterogeneity through non-vanishing posterior uncertainty. The framework supports both likelihood-exact and likelihood-free settings and provides gradient estimators for multiple scoring rules (log, quadratic, CRPS), enabling practical SGD-based optimization. Empirically, PVI improves held-out predictive performance on real data tasks (election analysis) and likelihood-free cryoEM, while also acting as a diagnostic for model expansion by exposing parameter heterogeneity. Overall, PVI offers a robust, diagnostic, and scalable approach to predictive inference that directly targets predictive accuracy and uncertainty calibration in the presence of misspecification.

Abstract

Vanilla variational inference finds an optimal approximation to the Bayesian posterior distribution, but even the exact Bayesian posterior is often not meaningful under model misspecification. We propose predictive variational inference (PVI): a general inference framework that seeks and samples from an optimal posterior density such that the resulting posterior predictive distribution is as close to the true data generating process as possible, while this closeness is measured by multiple scoring rules. By optimizing the objective, the predictive variational inference is generally not the same as, or even attempting to approximate, the Bayesian posterior, even asymptotically. Rather, we interpret it as implicit hierarchical expansion. Further, the learned posterior uncertainty detects heterogeneity of parameters among the population, enabling automatic model diagnosis. This framework applies to both likelihood-exact and likelihood-free models. We demonstrate its application in real data examples.

Predictive variational inference: Learn the predictively optimal posterior distribution

TL;DR

Predictive Variational Inference (PVI) reframes posterior inference as optimizing the posterior predictive distribution to be close to the data-generating process under a chosen scoring rule, rather than approximating the exact Bayesian posterior. By using flexible variational families (e.g., normalizing flows) and regularization that can interpolate toward Bayesian posteriors, PVI yields predictive-optimal posteriors that may differ from traditional Bayes, especially under model misspecification, and can reveal population-level heterogeneity through non-vanishing posterior uncertainty. The framework supports both likelihood-exact and likelihood-free settings and provides gradient estimators for multiple scoring rules (log, quadratic, CRPS), enabling practical SGD-based optimization. Empirically, PVI improves held-out predictive performance on real data tasks (election analysis) and likelihood-free cryoEM, while also acting as a diagnostic for model expansion by exposing parameter heterogeneity. Overall, PVI offers a robust, diagnostic, and scalable approach to predictive inference that directly targets predictive accuracy and uncertainty calibration in the presence of misspecification.

Abstract

Vanilla variational inference finds an optimal approximation to the Bayesian posterior distribution, but even the exact Bayesian posterior is often not meaningful under model misspecification. We propose predictive variational inference (PVI): a general inference framework that seeks and samples from an optimal posterior density such that the resulting posterior predictive distribution is as close to the true data generating process as possible, while this closeness is measured by multiple scoring rules. By optimizing the objective, the predictive variational inference is generally not the same as, or even attempting to approximate, the Bayesian posterior, even asymptotically. Rather, we interpret it as implicit hierarchical expansion. Further, the learned posterior uncertainty detects heterogeneity of parameters among the population, enabling automatic model diagnosis. This framework applies to both likelihood-exact and likelihood-free models. We demonstrate its application in real data examples.

Paper Structure

This paper contains 42 sections, 7 theorems, 38 equations, 8 figures, 4 tables.

Key Result

Proposition 1

Let the variational distribution $q_\phi(\theta)$ be parameterized by $\phi\in\Phi$, where $\Phi$ is compact. With any strictly proper score function $S$, an unknown true data generating process distribution $p_{\mathrm{true}}(y)$, a likelihood model $p(y|\theta)$, and a size-$n$ sample $y_1,y_2,... If $\mathop{\mathrm{\mathbb{E}}}\nolimits_{y\sim p_{\mathrm{true}}}\left[\sup_{\phi\in\Phi}\left|S\

Figures (8)

  • Figure 1: When inferring the molecule angle in a heterogeneous population using cryoEM images, the exact Bayes leads to an overconfident posterior as the sample $n$ grows, while PVI produces a variational distribution close to the ground-truth of the population. For details, see Section \ref{['sec:exp']}.
  • Figure 2: Optimal standard deviation of $q(\theta)$ with PVI or VI for the simple normal case from 50 simulations. The convergence is demonstrated for both well-specified case ($\sigma_{\text{true}}=1$) and misspecified case ($\sigma_{\text{true}}=2$).
  • Figure 3: Distribution of inferred logit across states with $x=1$ for ethnic group $1$ from four different voting models. To focus on the variability across states, we set $\beta_{2,1}$ to its sample mean. VI always concentrates, while PVI only concentrates when the model is correct.
  • Figure 4: Regression logits of four states learned from the Current Population Survey’s (CPS) post-election voting and registration supplement in 2000. Unrelated parameters are replaced by their sample means. Two models are compared: one with a constant coefficient, and one with varying coefficients among states. On the first model, PVI assigns different variances to different states. A higher variance implies more significant change of slope after varying the coefficients. Such behavior can not be detected on VI.
  • Figure 5: Parameter inference for cryoEM with PVI. In the first panel, the y-axis reports the gap $K$ between the inferred distribution and the true distribution with varying SNR and sample size, where the $K$ is computed from a two-sample KS test. The second figure demonstrates that PVI inferred posterior distribution accurately recovers the true distribution in a realistic setting when $N=10,000$ and $\text{SNR}\approx 0.13$.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Corollary 1
  • Corollary 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • ...and 1 more