Table of Contents
Fetching ...

Particle Semi-Implicit Variational Inference

Jen Ning Lim, Adam M. Johansen

TL;DR

The paper addresses the intractability of directly optimizing the ELBO in semi-implicit variational inference (SIVI) when the mixing distribution is parameterized implicitly, by introducing Particle Variational Inference (PVI). PVI formulates a gradient flow on the Euclidean–Wasserstein geometry for the pair $(\theta,r)\in \Theta\times\mathcal{P}(\mathbb{R}^{d_z})$, with a regularized free energy $\mathcal{E}_\lambda(\theta,r)$ whose minimizers correspond to optimal variational posteriors. A practical, particle-based discretization yields an actionable algorithm that directly optimizes the ELBO without restrictive parametric forms for $r$, and a theoretical analysis establishes existence/uniqueness and propagation of chaos for the particle system. Empirically, PVI outperforms prior SIVI methods across density estimation, Bayesian logistic regression, and Bayesian neural networks, while offering greater expressivity via learned mixing distributions. The work thus provides a principled, scalable route to richer variational families in Bayesian inference and offers insights into the associated gradient-flow dynamics.

Abstract

Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible, so they resort to one of the following: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a free energy functional. PVI arises naturally as a particle approximation of a Euclidean--Wasserstein gradient flow and, unlike prior works, it directly optimizes the ELBO whilst making no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably compared to other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.

Particle Semi-Implicit Variational Inference

TL;DR

The paper addresses the intractability of directly optimizing the ELBO in semi-implicit variational inference (SIVI) when the mixing distribution is parameterized implicitly, by introducing Particle Variational Inference (PVI). PVI formulates a gradient flow on the Euclidean–Wasserstein geometry for the pair $(\theta,r)\in \Theta\times\mathcal{P}(\mathbb{R}^{d_z})$, with a regularized free energy $\mathcal{E}_\lambda(\theta,r)$ whose minimizers correspond to optimal variational posteriors. A practical, particle-based discretization yields an actionable algorithm that directly optimizes the ELBO without restrictive parametric forms for $r$, and a theoretical analysis establishes existence/uniqueness and propagation of chaos for the particle system. Empirically, PVI outperforms prior SIVI methods across density estimation, Bayesian logistic regression, and Bayesian neural networks, while offering greater expressivity via learned mixing distributions. The work thus provides a principled, scalable route to richer variational families in Bayesian inference and offers insights into the associated gradient-flow dynamics.

Abstract

Semi-implicit variational inference (SIVI) enriches the expressiveness of variational families by utilizing a kernel and a mixing distribution to hierarchically define the variational distribution. Existing SIVI methods parameterize the mixing distribution using implicit distributions, leading to intractable variational densities. As a result, directly maximizing the evidence lower bound (ELBO) is not possible, so they resort to one of the following: optimizing bounds on the ELBO, employing costly inner-loop Markov chain Monte Carlo runs, or solving minimax objectives. In this paper, we propose a novel method for SIVI called Particle Variational Inference (PVI) which employs empirical measures to approximate the optimal mixing distributions characterized as the minimizer of a free energy functional. PVI arises naturally as a particle approximation of a Euclidean--Wasserstein gradient flow and, unlike prior works, it directly optimizes the ELBO whilst making no parametric assumption about the mixing distribution. Our empirical results demonstrate that PVI performs favourably compared to other SIVI methods across various tasks. Moreover, we provide a theoretical analysis of the behaviour of the gradient flow of a related free energy functional: establishing the existence and uniqueness of solutions as well as propagation of chaos results.
Paper Structure (36 sections, 18 theorems, 114 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 18 theorems, 114 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Given a $\mathcal{Q}_{\mathtt{YuZ}}$-variational family of the form $\mathcal{Q}_{\mathtt{YuZ}}:= \mathcal{Q}(\mathcal{K}_{\mathcal{F}; \phi, p_{k}}, \mathcal{R}_{\mathcal{G}; p_{r}})$, then there is a $\mathcal{Q}_{\mathtt{YiZ}}$-variational family and $\mathcal{Q}_{\mathtt{TR}}$-variational family

Figures (3)

  • Figure 1: Comparison of PVI and PVIZero on a bimodal mixture of Gaussians for various kernels. The plot shows the density $q_{\theta,r}$ from PVI and PVIZero as well as the KDE plot of $r$ from PVI described by $100$ particles.
  • Figure 2: Contour plots of the densities $q_{\theta,r}$ (in blue) against the true densities (in black) for various toy density estimation problems. We also plot the absolute difference in the density of $q_{\theta,r}$ and the true density, i.e., $|q_{\theta,r} - p|$.
  • Figure 3: Comparison between SIVI methods and MCMC on Bayesian logistic regression problem. (a) shows the marginal and pairwise approximations of posterior of the weights $x_1, x_2, x_3$, and (b) shows the scatter plot of the correlation coefficient of MCMC ($y$-axis) vs PVI ($x$-axis).

Theorems & Definitions (37)

  • Proposition 1: $\mathcal{Q}_{\mathtt{YuZ}} = \mathcal{Q}_{\mathtt{YiZ}}= \mathcal{Q}_{\mathtt{TR}}$
  • Proposition 2
  • Proposition 3
  • Proposition 4: First Variation of $\mathcal{E}_\lambda$ and $\sf{R}^{\textrm{E}}_\lambda$
  • Proposition 5: Contracting Gradient Dynamics
  • Proposition 6
  • Proposition 7: $\Gamma$-convergence and convergence of minima
  • Proposition 8: Existence and Uniqueness
  • Proposition 9: Propagation of chaos
  • Definition B.1: $\Gamma$-convergence
  • ...and 27 more