Table of Contents
Fetching ...

Robust Approximate Sampling via Stochastic Gradient Barker Dynamics

Lorenzo Mauri, Giacomo Zanella

TL;DR

This work extends Barker's robust MCMC proposal to stochastic-gradient settings, producing the stochastic-gradient Barker dynamics (SGBD). It analyzes bias caused by minibatch gradient noise and proposes a corrected estimator (c-SGBD) based on a normal-noise assumption to mitigate bias, plus an extreme variant (e-SGBD) for high-noise regimes. Empirical results across skewed, ill-conditioned, and high-dimensional Bayesian problems show SGBD to be more robust to hyperparameter choices and gradient heterogeneity than SGLD, with c-SGBD often improving accuracy and e-SGBD offering fast convergence. The approach provides a practical, robust alternative to SGLD for large-scale Bayesian inference with complex posteriors.

Abstract

Stochastic Gradient (SG) Markov Chain Monte Carlo algorithms (MCMC) are popular algorithms for Bayesian sampling in the presence of large datasets. However, they come with little theoretical guarantees and assessing their empirical performances is non-trivial. In such context, it is crucial to develop algorithms that are robust to the choice of hyperparameters and to gradients heterogeneity since, in practice, both the choice of step-size and behaviour of target gradients induce hard-to-control biases in the invariant distribution. In this work we introduce the stochastic gradient Barker dynamics (SGBD) algorithm, extending the recently developed Barker MCMC scheme, a robust alternative to Langevin-based sampling algorithms, to the stochastic gradient framework. We characterize the impact of stochastic gradients on the Barker transition mechanism and develop a bias-corrected version that, under suitable assumptions, eliminates the error due to the gradient noise in the proposal. We illustrate the performance on a number of high-dimensional examples, showing that SGBD is more robust to hyperparameter tuning and to irregular behavior of the target gradients compared to the popular stochastic gradient Langevin dynamics algorithm.

Robust Approximate Sampling via Stochastic Gradient Barker Dynamics

TL;DR

This work extends Barker's robust MCMC proposal to stochastic-gradient settings, producing the stochastic-gradient Barker dynamics (SGBD). It analyzes bias caused by minibatch gradient noise and proposes a corrected estimator (c-SGBD) based on a normal-noise assumption to mitigate bias, plus an extreme variant (e-SGBD) for high-noise regimes. Empirical results across skewed, ill-conditioned, and high-dimensional Bayesian problems show SGBD to be more robust to hyperparameter choices and gradient heterogeneity than SGLD, with c-SGBD often improving accuracy and e-SGBD offering fast convergence. The approach provides a practical, robust alternative to SGLD for large-scale Bayesian inference with complex posteriors.

Abstract

Stochastic Gradient (SG) Markov Chain Monte Carlo algorithms (MCMC) are popular algorithms for Bayesian sampling in the presence of large datasets. However, they come with little theoretical guarantees and assessing their empirical performances is non-trivial. In such context, it is crucial to develop algorithms that are robust to the choice of hyperparameters and to gradients heterogeneity since, in practice, both the choice of step-size and behaviour of target gradients induce hard-to-control biases in the invariant distribution. In this work we introduce the stochastic gradient Barker dynamics (SGBD) algorithm, extending the recently developed Barker MCMC scheme, a robust alternative to Langevin-based sampling algorithms, to the stochastic gradient framework. We characterize the impact of stochastic gradients on the Barker transition mechanism and develop a bias-corrected version that, under suitable assumptions, eliminates the error due to the gradient noise in the proposal. We illustrate the performance on a number of high-dimensional examples, showing that SGBD is more robust to hyperparameter tuning and to irregular behavior of the target gradients compared to the popular stochastic gradient Langevin dynamics algorithm.
Paper Structure (29 sections, 8 theorems, 51 equations, 17 figures, 6 algorithms)

This paper contains 29 sections, 8 theorems, 51 equations, 17 figures, 6 algorithms.

Key Result

Proposition 1

Under Condition cond:symmetry we have

Figures (17)

  • Figure 1: Shrinkage effect and bias correction. Plot of $p({\partial}_j g(\theta), z)$ (black line; $p$) and Monte Carlo estimates of $\mathbb{E}[p(\hat{\partial}_j g(\theta), z)]$ (dotted blue line; $\textbf{E}\hat{p}$), and $\mathbb{E}[\tilde{p}(\hat{\partial}_j g(\theta), z)]$ (dashed dark blue line; $\textbf{E}\tilde{p}$) versus the proposed increment $z$; for a logistic regression example with real data (see supplement for more details). Vertical red lines indicate $-1.702/\tau_\theta$ and $1.702/\tau_\theta$.
  • Figure 2: Toy Example: Skew-Normal target with isotropic Gaussian Noise. Shape parameter (in log-scale) vs relative bias of mean (left) and invariant distribution of the samplers for different levels of $\alpha$: 5 (top) and 20 (bottom) (right). Red refers to vanilla Langevin-based schemes and blue to vanilla Barker-based schemes. Dotted (dashed resp.) lines are produced with $\sigma_1=0.1\times sd(\pi_\alpha)$ ($\sigma_2=0.5\times sd(\pi_\alpha)$ resp.) of the target standard deviation. The grey shaded area in the right plots represents the true target distribution density.
  • Figure 3: Sepsis dataset example. Traceplots of two coordinates with a different scale ($\theta_1$ on the left has small scale) with two step-size configuration: small $\sigma$ (top), larger $\sigma$ (bottom). Red refers to v-SGLD and blue to v-SGBD. Black horizontal lines indicate the interval centered at the posterior mean with two standard deviations width.
  • Figure 4: Bayesian Probabilistic Matrix Factorization: Predictive Accuracy on the MovieLens dataset. Sample (left) and MCMC (right) estimates rMSE. Red refers to SGLD and blue to the SGBD. Lighter and dotted lines refer to vanilla implementations of the algorithm, darker and dashed-dotted lines to their extreme variants.
  • Figure 5: ICA: log likelihood on the MEG dataset. Log likelihood produced by each sample (left) and by the MCMC estimates (right) on held-out data. Red refers to SGLD and blue to SGBD. For both algorithms, the vanilla (lighter dotted lines), corrected (medium scale dashed lines) and extreme (darker dotted-dashed lines) versions are displayed.
  • ...and 12 more figures

Theorems & Definitions (20)

  • Proposition 1: Direction of bias
  • Proposition 2
  • Remark 1
  • Corollary 1: Approximate unbiasedness of $\tilde{p}$
  • Remark 2
  • Remark 3
  • Proposition 3: Noise tolerance
  • Definition 1: Symmetric estimator
  • Proposition 4
  • Corollary 2
  • ...and 10 more