Table of Contents
Fetching ...

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Qiang Liu, Dilin Wang

TL;DR

This work introduces Stein Variational Gradient Descent (SVGD), a general-purpose, particle-based variational inference method that deterministically transports a set of particles toward a target posterior by performing functional gradient descent in an RKHS to minimize the KL divergence. It leverages kernelized Stein discrepancy to derive a closed-form, steepest-descent direction for updates, yielding updates that combine a smoothed gradient toward the posterior with a repulsive interaction to maintain particle diversity. The main contributions include a rigorous link between KL derivatives under smooth transforms and Stein’s identity, a practical SVGD algorithm that reduces to MAP with a single particle and scales to multi-particle Bayesian inference, and extensive experiments showing competitive performance on toy and real data, with favorable efficiency characteristics. The method offers a general, user-friendly variational tool capable of unnormalized-posterior inference and scales to large datasets via minibatching and kernel tricks, bridging optimization-style updates with probabilistic inference.

Abstract

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest.

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

TL;DR

This work introduces Stein Variational Gradient Descent (SVGD), a general-purpose, particle-based variational inference method that deterministically transports a set of particles toward a target posterior by performing functional gradient descent in an RKHS to minimize the KL divergence. It leverages kernelized Stein discrepancy to derive a closed-form, steepest-descent direction for updates, yielding updates that combine a smoothed gradient toward the posterior with a repulsive interaction to maintain particle diversity. The main contributions include a rigorous link between KL derivatives under smooth transforms and Stein’s identity, a practical SVGD algorithm that reduces to MAP with a single particle and scales to multi-particle Bayesian inference, and extensive experiments showing competitive performance on toy and real data, with favorable efficiency characteristics. The method offers a general, user-friendly variational tool capable of unnormalized-posterior inference and scales to large datasets via minibatching and kernel tricks, bridging optimization-style updates with probabilistic inference.

Abstract

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest.

Paper Structure

This paper contains 16 sections, 3 theorems, 11 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Let ${\boldsymbol{T}}(x) = x+ \epsilon {\boldsymbol{\phi}}(x)$ and $q_{[{\boldsymbol{T}}]}(z)$ the density of $z = {\boldsymbol{T}}(x)$ when $x\sim q(x)$, we have where ${\mathcal{A}}_p {\boldsymbol{\phi}}(x) = \nabla_x \log p(x) {\boldsymbol{\phi}}(x)^\top + \nabla_{x} {\boldsymbol{\phi}}(x)$ is the Stein operator.

Figures (3)

  • Figure 1: Toy example with 1D Gaussian mixture. The red dashed lines are the target density function and the solid green lines are the densities of the particles at different iterations of our algorithm (estimated using kernel density estimator) . Note that the initial distribution is set to have almost zero overlap with the target distribution, and our method demonstrates the ability of escaping the local mode on the left to recover the mode on the left that is further away. We use $n=100$ particles.
  • Figure 2: We use the same setting as Figure \ref{['fig:1dgmm1']}, except varying the number $n$ of particles. (a)-(c) show the mean square errors when using the obtained particles to estimate expectation $\mathbb{E}_p(h(x))$ for $h(x)=x$, $x^2$, and $\cos(\omega x+ b)$; for $\cos(\omega x+ b)$, we random draw $\omega\sim \mathcal{N}(0,1)$ and $b\sim \mathrm{Uniform}([0,2\pi])$ and report the average MSE over $20$ random draws of $\omega$ and $b$.
  • Figure 3: Results on Bayesian logistic regression on Covertype dataset w.r.t. epochs and the particle size $n$. We use $n=100$ particles for our method, parallel SGLD and PMD, and average the last $100$ points for the sequential SGLD. The "particle-based" methods (solid lines) in principle require 100 times of likelihood evaluations compare with DVSI and sequential SGLD (dash lines) per iteration, but are implemented efficiently using Matlab matrix operation (e.g., each iteration of parallel SGLD is about 3 times slower than sequential SGLD). We partition the data into $80\%$ for training and $20\%$ for testing and average on 50 random trials. A mini-batch size of $50$ is used for all the algorithms.

Theorems & Definitions (3)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3