Table of Contents
Fetching ...

ELBOing Stein: Variational Bayes with Stein Mixture Inference

Ola Rønning, Eric Nalisnick, Christophe Ley, Padhraic Smyth, Thomas Hamelryck

TL;DR

This paper tackles variance collapse in Stein variational methods by introducing Stein Mixture Inference (SMI), which represents the variational posterior as a uniform mixture of $m$ guides parameterized by particles. SMI optimizes a mixture ELBO $\, ext{ELBO}_{SMI}$, augmented by a diversification term that enables particle spread and remains an ELBO when the entropic coefficient is set to $1$. By embedding NSVGD within this mixture framework, the authors derive a tractable kernelized gradient that produces attractive forces toward high-likelihood regions while maintaining repulsive forces to prevent collapse, enabling efficient uncertainty quantification with fewer particles. Empirically, SMI mitigates variance collapse in small-to-moderate Bayesian neural networks, yielding improved calibrated uncertainty on synthetic tasks, UCI benchmarks, and MNIST classification, and demonstrates better particle efficiency than SVGD. The work establishes a principled, ELBO-based pathway to variational Bayes with mixtures, combining the strengths of density-based and sample-based particle methods for tall-and-wide data scenarios.

Abstract

Stein variational gradient descent (SVGD) [Liu and Wang, 2016] performs approximate Bayesian inference by representing the posterior with a set of particles. However, SVGD suffers from variance collapse, i.e. poor predictions due to underestimating uncertainty [Ba et al., 2021], even for moderately-dimensional models such as small Bayesian neural networks (BNNs). To address this issue, we generalize SVGD by letting each particle parameterize a component distribution in a mixture model. Our method, Stein Mixture Inference (SMI), optimizes a lower bound to the evidence (ELBO) and introduces user-specified guides parameterized by particles. SMI extends the Nonlinear SVGD framework [Wang and Liu, 2019] to the case of variational Bayes. SMI effectively avoids variance collapse, judging by a previously described test developed for this purpose, and performs well on standard data sets. In addition, SMI requires considerably fewer particles than SVGD to accurately estimate uncertainty for small BNNs. The synergistic combination of NSVGD, ELBO optimization and user-specified guides establishes a promising approach towards variational Bayesian inference in the case of tall and wide data.

ELBOing Stein: Variational Bayes with Stein Mixture Inference

TL;DR

This paper tackles variance collapse in Stein variational methods by introducing Stein Mixture Inference (SMI), which represents the variational posterior as a uniform mixture of guides parameterized by particles. SMI optimizes a mixture ELBO , augmented by a diversification term that enables particle spread and remains an ELBO when the entropic coefficient is set to . By embedding NSVGD within this mixture framework, the authors derive a tractable kernelized gradient that produces attractive forces toward high-likelihood regions while maintaining repulsive forces to prevent collapse, enabling efficient uncertainty quantification with fewer particles. Empirically, SMI mitigates variance collapse in small-to-moderate Bayesian neural networks, yielding improved calibrated uncertainty on synthetic tasks, UCI benchmarks, and MNIST classification, and demonstrates better particle efficiency than SVGD. The work establishes a principled, ELBO-based pathway to variational Bayes with mixtures, combining the strengths of density-based and sample-based particle methods for tall-and-wide data scenarios.

Abstract

Stein variational gradient descent (SVGD) [Liu and Wang, 2016] performs approximate Bayesian inference by representing the posterior with a set of particles. However, SVGD suffers from variance collapse, i.e. poor predictions due to underestimating uncertainty [Ba et al., 2021], even for moderately-dimensional models such as small Bayesian neural networks (BNNs). To address this issue, we generalize SVGD by letting each particle parameterize a component distribution in a mixture model. Our method, Stein Mixture Inference (SMI), optimizes a lower bound to the evidence (ELBO) and introduces user-specified guides parameterized by particles. SMI extends the Nonlinear SVGD framework [Wang and Liu, 2019] to the case of variational Bayes. SMI effectively avoids variance collapse, judging by a previously described test developed for this purpose, and performs well on standard data sets. In addition, SMI requires considerably fewer particles than SVGD to accurately estimate uncertainty for small BNNs. The synergistic combination of NSVGD, ELBO optimization and user-specified guides establishes a promising approach towards variational Bayesian inference in the case of tall and wide data.

Paper Structure

This paper contains 55 sections, 1 theorem, 34 equations, 8 figures, 8 tables.

Key Result

Theorem 3.1

The Kernelized Steepest Perturbation pmlr-v97-wang19h Let $F[\rho] + \alpha \mathbb{H}[\rho]$ be the variational objective for a transport $T(\rho)=\rho + \epsilon\phi[\rho]$, with $\epsilon>0$ and distribution $\rho$ with $\mathop{\mathrm{supp}}\nolimits \rho \subseteq \mathop{\mathrm{dom}}\nolimit

Figures (8)

  • Figure 1: Variational inference with SVGD-derived particles liu2016stein versus with an SMI-derived probability density, formulated as a mixture model (this work). Left: SVGD uses $m$ particles ${\pmb{\theta}}_\ell$ to approximate the posterior $p({\pmb{\theta}}|\mathcal{D})$. Right: SMI uses a mixture model (with uniform weights) of $m$ guides $q({\pmb{\theta}}|{\pmb{\psi}}_\ell)$, parameterized by particles ${\pmb{\psi}}_\ell$ to approximate $p({\pmb{\theta}}|\mathcal{D})$. As a result, SMI approximates a Bayesian posterior with a richer model that alleviates variance collapse in higher dimensional posteriors.
  • Figure 2: Left and middle (zoom): Variance estimation of a standard multivariate Gaussian obtained with 20-particle ASVGD, SMI and SVGD. Only SMI does not collapse and is robust to changing $\alpha$. Right: SMI with one particle and Gaussian guide exactly recovers the multivariate Gaussian.
  • Figure 3: Top row: High-density interval (HDI) for the low-dimensional model inferred using SMI, SVGD, ASVGD, OVI and NUTS on the 1D wave dataset (dotted line). SVGD, ASVGD, and SMI use five particles. The posteriors are inferred with data drawn from the In region, highlighted with vertical lines. NUTS serves as a reference. Bottom row: HDI for the moderate-dimensional model. ASVGD and SVGD display collapse by a significant narrowing in HDI between the In regions when comparing the low to moderate dimensions. On the other hand, both OVI and SMI widen the HDI with the richer model for the in-between region. In contrast to SMI, OVI overestimates the variance in the In region, where data is available, for the mid-sized model.
  • Figure 4: Left: Frobenius distance between the estimated and the true covariance matrix in the Gaussian variance estimation experiment, using 20 particles for all methods. Only SMI achieves distances close to zero, indicating that it accurately captures the shape of the standard Gaussian, unlike the other methods. Right: Frobenius distance when SMI uses a single particle. In this case, SMI perfectly recovers the posterior.
  • Figure 5: Mean location estimates of a standard Gaussian distribution across different dimensionalities and repulsion scaling ($\alpha$) for SMI (with 1 and 20 particles), ASVGD, SVGD and RESVGD (with 20 particles). RESVGD repulsion is not scaled. The "Actual" line represents the true mean location (zero). Only RESVGD exhibits significant bias, particularly in higher dimensions.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 3.1