Table of Contents
Fetching ...

Adaptive Kernel Selection for Stein Variational Gradient Descent

Moritz Melcher, Simon Weissmann, Ashia C. Wilson, Jakob Zech

TL;DR

This work tackles the sensitivity of Stein Variational Gradient Descent to kernel choice by introducing Adaptive SVGD (Ad-SVGD), which adaptively tunes kernel parameters to maximize the kernelized Stein discrepancy during inference. The authors provide a mean-field convergence analysis for the adaptive framework and demonstrate that focusing on the worst-case KSD over a kernel class accelerates posterior transport and variance recovery. Empirically, Ad-SVGD outperforms the median-heuristic SVGD across several tasks, including high-dimensional inverse problems and Bayesian logistic regression, by better capturing posterior uncertainty. The approach offers a flexible pathway to enhancing SVGD robustness and practical performance in complex Bayesian inference problems.

Abstract

A central challenge in Bayesian inference is efficiently approximating posterior distributions. Stein Variational Gradient Descent (SVGD) is a popular variational inference method which transports a set of particles to approximate a target distribution. The SVGD dynamics are governed by a reproducing kernel Hilbert space (RKHS) and are highly sensitive to the choice of the kernel function, which directly influences both convergence and approximation quality. The commonly used median heuristic offers a simple approach for setting kernel bandwidths but lacks flexibility and often performs poorly, particularly in high-dimensional settings. In this work, we propose an alternative strategy for adaptively choosing kernel parameters over an abstract family of kernels. Recent convergence analyses based on the kernelized Stein discrepancy (KSD) suggest that optimizing the kernel parameters by maximizing the KSD can improve performance. Building on this insight, we introduce Adaptive SVGD (Ad-SVGD), a method that alternates between updating the particles via SVGD and adaptively tuning kernel bandwidths through gradient ascent on the KSD. We provide a simplified theoretical analysis that extends existing results on minimizing the KSD for fixed kernels to our adaptive setting, showing convergence properties for the maximal KSD over our kernel class. Our empirical results further support this intuition: Ad-SVGD consistently outperforms standard heuristics in a variety of tasks.

Adaptive Kernel Selection for Stein Variational Gradient Descent

TL;DR

This work tackles the sensitivity of Stein Variational Gradient Descent to kernel choice by introducing Adaptive SVGD (Ad-SVGD), which adaptively tunes kernel parameters to maximize the kernelized Stein discrepancy during inference. The authors provide a mean-field convergence analysis for the adaptive framework and demonstrate that focusing on the worst-case KSD over a kernel class accelerates posterior transport and variance recovery. Empirically, Ad-SVGD outperforms the median-heuristic SVGD across several tasks, including high-dimensional inverse problems and Bayesian logistic regression, by better capturing posterior uncertainty. The approach offers a flexible pathway to enhancing SVGD robustness and practical performance in complex Bayesian inference problems.

Abstract

A central challenge in Bayesian inference is efficiently approximating posterior distributions. Stein Variational Gradient Descent (SVGD) is a popular variational inference method which transports a set of particles to approximate a target distribution. The SVGD dynamics are governed by a reproducing kernel Hilbert space (RKHS) and are highly sensitive to the choice of the kernel function, which directly influences both convergence and approximation quality. The commonly used median heuristic offers a simple approach for setting kernel bandwidths but lacks flexibility and often performs poorly, particularly in high-dimensional settings. In this work, we propose an alternative strategy for adaptively choosing kernel parameters over an abstract family of kernels. Recent convergence analyses based on the kernelized Stein discrepancy (KSD) suggest that optimizing the kernel parameters by maximizing the KSD can improve performance. Building on this insight, we introduce Adaptive SVGD (Ad-SVGD), a method that alternates between updating the particles via SVGD and adaptively tuning kernel bandwidths through gradient ascent on the KSD. We provide a simplified theoretical analysis that extends existing results on minimizing the KSD for fixed kernels to our adaptive setting, showing convergence properties for the maximal KSD over our kernel class. Our empirical results further support this intuition: Ad-SVGD consistently outperforms standard heuristics in a variety of tasks.

Paper Structure

This paper contains 21 sections, 4 theorems, 39 equations, 15 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Suppose that Assumptions ass:target1-ass:kernel are satisfied. For any $\alpha > 1$ with there exists $c_\gamma>0$ such that for all $n\in\mathbb{N}$ where $(\mu_n)_{n\in\mathbb{N}}$ is generated by eq:ad-SVGD_exact.

Figures (15)

  • Figure 1: Final Wasserstein $1$-distances for one-dimensional example using SVGD with different fixed bandwidths $h$.
  • Figure 2: GP reconstruction for ODE-based inverse problem, showing mean and 90% confidence interval
  • Figure 3: Aggregated results (mean and 95% confidence interval over 56 random seeds) for ODE-based inverse problem using Med-SVGD and Ad-SVGD
  • Figure 4: Behavior of bandwidth parameter for ODE-based inverse problem using Ad-SVGD, aggregated over 56 random seeds
  • Figure 5: Approximation quality of SVGD particles for Bayesian logistic regression measured by prediction accuracy (left) and $\mathrm{MMD}^2$ to reference samples (right)
  • ...and 10 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Corollary 2
  • Theorem 3
  • proof
  • Theorem 4
  • proof