Table of Contents
Fetching ...

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

Bohan Wu, David Blei

TL;DR

Xi-variational inference ($\Xi$-VI) extends naive mean-field VI by adding an expressivity penalty that discourages excessive factorization, with a tunable regularization parameter $\lambda$ that smoothly trades off statistical fidelity and computational efficiency. The inner coupling between variables is solved via entropic optimal transport, implemented through a multi-marginal Sinkhorn algorithm, yielding a posterior that interpolates between MFVI and the exact Bayes posterior. The authors establish frequentist guarantees, Bernstein–von Mises-type results, and high-dimensional asymptotics, providing regimes where $\Xi$-VI behaves like MFVI, Bayes-optimal inference, or an intermediate tempered posterior. They demonstrate practical gains on multivariate Gaussian, Bayesian linear regression with Laplace priors, and hierarchical eight-schools models, and discuss computational complexity, stability, and scalable strategies. Overall, $\Xi$-VI offers a principled, theory-grounded framework that bridges variational accuracy with tractable computation via entropic OT.

Abstract

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $Ξ$-variational inference ($Ξ$-VI). $Ξ$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that $Ξ$-variational posteriors effectively recover the true posterior dependency, where the dependence is downweighted by the regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of $Ξ$-variational approximation and how it affects computational considerations, providing a rough characterization of the statistical-computational trade-off in $Ξ$-VI. We also investigate the frequentist properties of $Ξ$-VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for achieving polynomial-time approximate inference using the method. Finally, we demonstrate the practical advantage of $Ξ$-VI over mean-field variational inference on simulated and real data.

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

TL;DR

Xi-variational inference (-VI) extends naive mean-field VI by adding an expressivity penalty that discourages excessive factorization, with a tunable regularization parameter that smoothly trades off statistical fidelity and computational efficiency. The inner coupling between variables is solved via entropic optimal transport, implemented through a multi-marginal Sinkhorn algorithm, yielding a posterior that interpolates between MFVI and the exact Bayes posterior. The authors establish frequentist guarantees, Bernstein–von Mises-type results, and high-dimensional asymptotics, providing regimes where -VI behaves like MFVI, Bayes-optimal inference, or an intermediate tempered posterior. They demonstrate practical gains on multivariate Gaussian, Bayesian linear regression with Laplace priors, and hierarchical eight-schools models, and discuss computational complexity, stability, and scalable strategies. Overall, -VI offers a principled, theory-grounded framework that bridges variational accuracy with tractable computation via entropic OT.

Abstract

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as -variational inference (-VI). -VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that -variational posteriors effectively recover the true posterior dependency, where the dependence is downweighted by the regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of -variational approximation and how it affects computational considerations, providing a rough characterization of the statistical-computational trade-off in -VI. We also investigate the frequentist properties of -VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for achieving polynomial-time approximate inference using the method. Finally, we demonstrate the practical advantage of -VI over mean-field variational inference on simulated and real data.
Paper Structure (31 sections, 31 theorems, 246 equations, 6 figures, 1 table, 3 algorithms)

This paper contains 31 sections, 31 theorems, 246 equations, 6 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Suppose we solve the Gaussian $\Xi$-VI problem eqn-mGaussian-1 with $\mathcal{N}(\mu_0, \Sigma_0)$ the exact posterior and $\lambda > 0$. Then the minimizer $\textrm{q}_\lambda^* = \mathcal{N}(\mu^*, \Sigma^*)$ where $\mu^*, \Sigma^*$ satisfy the following fixed point equations: For any matrix norm $\|.\|$, the following bounds hold:

Figures (6)

  • Figure 1: $\Xi$-VI solutions for a bivariate Gaussian posterior for varying $\lambda$. The left panel illustrates the transition of the variational posterior $\textrm{q}^*_\lambda$ from closely approximating the exact posterior (at low $\lambda$) to resembling the mean-field approximation (at high $\lambda$). The right panel shows the covariance between the two normal coordinates versus $\lambda$ on a log scale. Note that the $\Xi$-variational approximation to the covariance is very accurate up to a critical $\lambda$ ($\approx 10^{-1}$), after which it degrades rapidly to $0$.
  • Figure 2: Left. accuracy of $\Xi$-VI for Laplace linear regression, measured in $W_2$ across values of $\lambda$. Right. runtime of $\Xi$-VI for Laplace linear regression, measured in the number of iterations to reduce the Sinkhorn error to $10^{-4}$, across values of $\lambda$.
  • Figure 3: Contour plots for the joint distribution of $\theta_1$ and $\theta_7$ across various variational approximation of the Eight School model. The subplots compare the exact posterior distribution with $\Xi$-variational posteriors for varying $\lambda$ values, and the MFVI approximation. A linear regression fitted slope of $\theta_7$ over $\theta_1$ is provided for each subplot. Each subplot includes a linear regression line showing the fitted slope of $\theta_7$ over $\theta_1$.
  • Figure 4: Comparison of the 95% posterior credible intervals for the maximum and minimum treatment effects across schools in the Eight School model. The sequence from left to right includes the exact posterior, $\Xi$-VI with $\lambda \in \{0,1,10,1000\}$, MFVI, normalizing flow (NFVI), full-rank ADVI (Full-rank ADVI), and Stein variational gradient descent (SVGD).
  • Figure 5: Left. approximation accuracy for the Eight School model of $\Xi$-VI across varying $\lambda$ compared with other VI methods, measured in KL divergence and $W_2$ distance. Right. runtime for the Eight School model as a function of varying $\lambda$, measured in the number of iterations to reduce the Sinkhorn error to $10^{-4}$.
  • ...and 1 more figures

Theorems & Definitions (65)

  • Proposition 1
  • Theorem 1: Bernstein von-Mises Theorem
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3
  • Corollary 3
  • Proposition 2: Altschuler2023
  • Remark 1
  • Remark 2
  • ...and 55 more