Adaptive Kernel Selection for Stein Variational Gradient Descent
Moritz Melcher, Simon Weissmann, Ashia C. Wilson, Jakob Zech
TL;DR
This work tackles the sensitivity of Stein Variational Gradient Descent to kernel choice by introducing Adaptive SVGD (Ad-SVGD), which adaptively tunes kernel parameters to maximize the kernelized Stein discrepancy during inference. The authors provide a mean-field convergence analysis for the adaptive framework and demonstrate that focusing on the worst-case KSD over a kernel class accelerates posterior transport and variance recovery. Empirically, Ad-SVGD outperforms the median-heuristic SVGD across several tasks, including high-dimensional inverse problems and Bayesian logistic regression, by better capturing posterior uncertainty. The approach offers a flexible pathway to enhancing SVGD robustness and practical performance in complex Bayesian inference problems.
Abstract
A central challenge in Bayesian inference is efficiently approximating posterior distributions. Stein Variational Gradient Descent (SVGD) is a popular variational inference method which transports a set of particles to approximate a target distribution. The SVGD dynamics are governed by a reproducing kernel Hilbert space (RKHS) and are highly sensitive to the choice of the kernel function, which directly influences both convergence and approximation quality. The commonly used median heuristic offers a simple approach for setting kernel bandwidths but lacks flexibility and often performs poorly, particularly in high-dimensional settings. In this work, we propose an alternative strategy for adaptively choosing kernel parameters over an abstract family of kernels. Recent convergence analyses based on the kernelized Stein discrepancy (KSD) suggest that optimizing the kernel parameters by maximizing the KSD can improve performance. Building on this insight, we introduce Adaptive SVGD (Ad-SVGD), a method that alternates between updating the particles via SVGD and adaptively tuning kernel bandwidths through gradient ascent on the KSD. We provide a simplified theoretical analysis that extends existing results on minimizing the KSD for fixed kernels to our adaptive setting, showing convergence properties for the maximal KSD over our kernel class. Our empirical results further support this intuition: Ad-SVGD consistently outperforms standard heuristics in a variety of tasks.
