Table of Contents
Fetching ...

Stochastic variance-reduced Gaussian variational inference on the Bures-Wasserstein manifold

Hoang Phuc Hau Luu, Hanlin Yu, Bernardo Williams, Marcelo Hartmann, Arto Klami

TL;DR

This work tackles Gaussian variational inference in the Bures–Wasserstein geometry by addressing high-variance forward BW-gradient estimates needed in forward–backward Euler optimization. It introduces SVRGVI, a variance-reduced estimator based on a Stein-style control variate $Z_k=\Sigma_k^{-1}(X_k-m_k)$ with adaptive coefficient $c$, which reduces variance without extra sampling and with only $O(d^2)$ additional cost per iteration from reusing the Cholesky factor. The authors prove variance reduction in a neighborhood of the optimal solution and under strong convexity under a trace condition, and show that variance reduction improves convergence bounds; they also demonstrate substantial empirical gains over BWGD and SGVI across Gaussian, Student’s t, and Bayesian logistic regression targets. The approach preserves the beneficial BW geometry properties while delivering orders-of-magnitude improvements in accuracy and stability, making BW-geometry Gaussian VI practical for high-dimensional Bayesian inference.

Abstract

Optimization in the Bures-Wasserstein space has been gaining popularity in the machine learning community since it draws connections between variational inference and Wasserstein gradient flows. The variational inference objective function of Kullback-Leibler divergence can be written as the sum of the negative entropy and the potential energy, making forward-backward Euler the method of choice. Notably, the backward step admits a closed-form solution in this case, facilitating the practicality of the scheme. However, the forward step is not exact since the Bures-Wasserstein gradient of the potential energy involves "intractable" expectations. Recent approaches propose using the Monte Carlo method -- in practice a single-sample estimator -- to approximate these terms, resulting in high variance and poor performance. We propose a novel variance-reduced estimator based on the principle of control variates. We theoretically show that this estimator has a smaller variance than the Monte-Carlo estimator in scenarios of interest. We also prove that variance reduction helps improve the optimization bounds of the current analysis. We demonstrate that the proposed estimator gains order-of-magnitude improvements over the previous Bures-Wasserstein methods.

Stochastic variance-reduced Gaussian variational inference on the Bures-Wasserstein manifold

TL;DR

This work tackles Gaussian variational inference in the Bures–Wasserstein geometry by addressing high-variance forward BW-gradient estimates needed in forward–backward Euler optimization. It introduces SVRGVI, a variance-reduced estimator based on a Stein-style control variate with adaptive coefficient , which reduces variance without extra sampling and with only additional cost per iteration from reusing the Cholesky factor. The authors prove variance reduction in a neighborhood of the optimal solution and under strong convexity under a trace condition, and show that variance reduction improves convergence bounds; they also demonstrate substantial empirical gains over BWGD and SGVI across Gaussian, Student’s t, and Bayesian logistic regression targets. The approach preserves the beneficial BW geometry properties while delivering orders-of-magnitude improvements in accuracy and stability, making BW-geometry Gaussian VI practical for high-dimensional Bayesian inference.

Abstract

Optimization in the Bures-Wasserstein space has been gaining popularity in the machine learning community since it draws connections between variational inference and Wasserstein gradient flows. The variational inference objective function of Kullback-Leibler divergence can be written as the sum of the negative entropy and the potential energy, making forward-backward Euler the method of choice. Notably, the backward step admits a closed-form solution in this case, facilitating the practicality of the scheme. However, the forward step is not exact since the Bures-Wasserstein gradient of the potential energy involves "intractable" expectations. Recent approaches propose using the Monte Carlo method -- in practice a single-sample estimator -- to approximate these terms, resulting in high variance and poor performance. We propose a novel variance-reduced estimator based on the principle of control variates. We theoretically show that this estimator has a smaller variance than the Monte-Carlo estimator in scenarios of interest. We also prove that variance reduction helps improve the optimization bounds of the current analysis. We demonstrate that the proposed estimator gains order-of-magnitude improvements over the previous Bures-Wasserstein methods.
Paper Structure (29 sections, 6 theorems, 79 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 29 sections, 6 theorems, 79 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Assume that $V$ is continuously differentiable. Let $\mu = \mathcal{N}(m, \Sigma) \in \mathop{\mathrm{BW}}\nolimits(\mathbb{R}^d)$. Then,

Figures (12)

  • Figure 1: Left: Optimization trajectories of our method compared to SGVI diao2023forward and BWGD lambert2022variational. The target is a $50$-dimensional Gaussian distribution, visualized here via the marginal distributions of the first two coordinates. Each ellipse represents a contour of a Gaussian: the black is the initial distribution, the red is the target, and the greys are intermediate steps. Our method is dramatically more stable and finds a more accurate final approximation. Right: the corresponding KL divergence, confirming our method is orders of magnitude more accurate.
  • Figure 2: Left: $\pi$ is a Gaussian, VI distribution $\mu$ is in the neighborhood of $\pi$. In this case, the true gradient, i.e., the expectation $\mathbb{E}_{\mu}\nabla V$, can be computed exactly (in navy blue). Our proposed estimator with $c=0.9$ (light blue) has a smaller variance than the Monte Carlo estimator (grey). These are 1,000 samples for each estimator, generated by drawing from $\mu$ and substituting the values into the respective estimator formulas. Right: The empirical variance of our proposed estimator when $c$ varies from $0$ to $2$. Note that $c=0$ corresponds to the Monte Carlo estimator.
  • Figure 3: KL divergence for Gaussian targets of varying dimensionality.
  • Figure 4: Performance of algorithms for Student's t target and Bayesian logistic regression.
  • Figure 5: Gaussian experiment: variance along iterations.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Remark 1
  • Theorem 1: Variance reduction around the optimal solution
  • Theorem 2: Variance reduction at large-variance distributions
  • Remark 2
  • Theorem 3: Convex case
  • Theorem 4: Strongly convex case
  • Remark 3
  • Lemma 2: Stein's lemma