Table of Contents
Fetching ...

Towards understanding Accelerated Stein Variational Gradient Flow -- Analysis of Generalized Bilinear Kernels for Gaussian target distributions

Viktor Stein, Wuchen Li

TL;DR

The paper proposes ASVGD, an accelerated SVGD variant that operates as a momentum-enabled gradient flow on the density manifold, augmented with a Stein-Wasserstein metric for stability. It proves that, for generalized bilinear kernels K(x,y) = x^T A y + 1 and Gaussian targets, the dynamics preserve Gaussianity and derive an A-optimal, parameter-free convergence rate depending on the target covariance’s condition number via sqrt(\\kappa(Q)). It also identifies an optimal damping constant independent of the smallest eigenvalue of Q, yielding strong asymptotic convergence guarantees, and demonstrates through simulations that ASVGD outperforms SVGD and other samplers on toy problems and Bayesian neural networks. These results suggest that acceleration in the density space can significantly improve sampling efficiency for high-dimensional Bayesian inference tasks, with practical impact on neural networks and related probabilistic modeling scenarios.

Abstract

Stein variational gradient descent (SVGD) is a kernel-based and non-parametric particle method for sampling from a target distribution, such as in Bayesian inference and other machine learning tasks. Different from other particle methods, SVGD does not require estimating the score, which is the gradient of the log-density. However, in practice, SVGD can be slow compared to score-estimation-based sampling algorithms. To design a fast and efficient high-dimensional sampling algorithm with the advantages of SVGD, we introduce accelerated SVGD (ASVGD), based on an accelerated gradient flow in a metric space of probability densities following Nesterov's method. We then derive a momentum-based discrete-time sampling algorithm, which evolves a set of particles deterministically. To stabilize the particles' position update, we also include a Wasserstein metric regularization. This paper extends the conference version \cite{SL2025}. For the bilinear kernel and Gaussian target distributions, we study the kernel parameter and damping parameters with an optimal convergence rate of the proposed dynamics. This is achieved by analyzing the linearized accelerated gradient flows at the equilibrium. Interestingly, the optimal parameter is a constant, which does not depend on the covariance of the target distribution. For the generalized kernel functions, such as the Gaussian kernel, numerical examples with varied target distributions demonstrate the effectiveness of ASVGD compared to SVGD and other popular sampling methods. Furthermore, we show that in the setting of Bayesian neural networks, ASVGD outperforms SVGD significantly in terms of log-likelihood and total iteration times.

Towards understanding Accelerated Stein Variational Gradient Flow -- Analysis of Generalized Bilinear Kernels for Gaussian target distributions

TL;DR

The paper proposes ASVGD, an accelerated SVGD variant that operates as a momentum-enabled gradient flow on the density manifold, augmented with a Stein-Wasserstein metric for stability. It proves that, for generalized bilinear kernels K(x,y) = x^T A y + 1 and Gaussian targets, the dynamics preserve Gaussianity and derive an A-optimal, parameter-free convergence rate depending on the target covariance’s condition number via sqrt(\\kappa(Q)). It also identifies an optimal damping constant independent of the smallest eigenvalue of Q, yielding strong asymptotic convergence guarantees, and demonstrates through simulations that ASVGD outperforms SVGD and other samplers on toy problems and Bayesian neural networks. These results suggest that acceleration in the density space can significantly improve sampling efficiency for high-dimensional Bayesian inference tasks, with practical impact on neural networks and related probabilistic modeling scenarios.

Abstract

Stein variational gradient descent (SVGD) is a kernel-based and non-parametric particle method for sampling from a target distribution, such as in Bayesian inference and other machine learning tasks. Different from other particle methods, SVGD does not require estimating the score, which is the gradient of the log-density. However, in practice, SVGD can be slow compared to score-estimation-based sampling algorithms. To design a fast and efficient high-dimensional sampling algorithm with the advantages of SVGD, we introduce accelerated SVGD (ASVGD), based on an accelerated gradient flow in a metric space of probability densities following Nesterov's method. We then derive a momentum-based discrete-time sampling algorithm, which evolves a set of particles deterministically. To stabilize the particles' position update, we also include a Wasserstein metric regularization. This paper extends the conference version \cite{SL2025}. For the bilinear kernel and Gaussian target distributions, we study the kernel parameter and damping parameters with an optimal convergence rate of the proposed dynamics. This is achieved by analyzing the linearized accelerated gradient flows at the equilibrium. Interestingly, the optimal parameter is a constant, which does not depend on the covariance of the target distribution. For the generalized kernel functions, such as the Gaussian kernel, numerical examples with varied target distributions demonstrate the effectiveness of ASVGD compared to SVGD and other popular sampling methods. Furthermore, we show that in the setting of Bayesian neural networks, ASVGD outperforms SVGD significantly in terms of log-likelihood and total iteration times.

Paper Structure

This paper contains 35 sections, 12 theorems, 91 equations, 3 figures, 4 tables, 4 algorithms.

Key Result

Lemma 4.1

Let $(\rho_t, \Phi_t)_{t > 0}$ solve eq:S_WS, $X_t \sim \rho_t$ and $Y_t \coloneqq \dot{X}_t$. For all $t > 0$ we have

Figures (3)

  • Figure 1: Particle trajectories of ASVGD, SVGD, with the generalized bilinear kernel, MALA, and ULD (from left to right). The potential is $f(x) = \frac{1}{2} x^{\mathrm{T}} Q x$, with $Q = [[3, -2], [-2, 3]]$ and we initialize the particles from a Gaussian distribution with mean $[1, 1]^{\mathrm{T}}$ and covariance $[[3, 2], [2, 3]]$.
  • Figure 2: Monte-Carlo-estimated KL divergence for two different choice of $A$ for the particle evolutions from \ref{['fig:Gaussian']}. We see that for ASVGD $A = \mathop{\mathrm{id}}\nolimits_2$ performs better than aligning $A$ with the target.
  • Figure 3: Comparing ASVGD to other sampling algorithms. For the double bananas target, we choose a constant high damping $\beta = 0.985$. We draw the initial particles from unit normal distributions with means $[0, 5]^{\mathrm{T}}$, $[0, 7]^{\mathrm{T}}$, and $[0, 0]^{\mathrm{T}}$, respectively.

Theorems & Definitions (59)

  • Remark 2.1
  • Remark 2.2
  • Remark 2.3: Bregman geometry and accelerated mirror descent
  • Definition 3.1: (Co)tangent space to $\widetilde{\mathop{\mathrm{\mathcal{P}}}\nolimits}(\Omega)$
  • Definition 3.2: Metric tensor field $G$ on $\widetilde{\mathop{\mathrm{\mathcal{P}}}\nolimits}(\Omega)$
  • Remark 3.1: Onsager operator, mobility function
  • Example 3.1: Wasserstein-2 metric
  • Example 3.2: Stein metric
  • Definition 3.3: First linear functional derivatives
  • Definition 3.4: Metric gradient flow on $\widetilde{\mathop{\mathrm{\mathcal{P}}}\nolimits}(\Omega)$
  • ...and 49 more