Table of Contents
Fetching ...

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Zier Mensch, Lars Holdijk, Samuel Duffield, Maxwell Aifer, Patrick J. Coles, Max Welling, Miranda C. N. Cheng

TL;DR

Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Abstract

Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

TL;DR

Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Abstract

Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.
Paper Structure (45 sections, 4 theorems, 78 equations, 11 figures, 2 tables)

This paper contains 45 sections, 4 theorems, 78 equations, 11 figures, 2 tables.

Key Result

Theorem 4.2

Under the Assumption ass:lyap, we find that the MSE is bounded by the following three contributions for some $C$ that depends on the target distribution. The covariance error term for SGLRW is never larger than that of SGLD while the other contributions are the same for both. Moreover, it is strictly smaller whenever $2\partial_iU\zeta_i +\zeta_i^2$ is non-vanishing for some direction $i$.

Figures (11)

  • Figure 1: Comparing SGLD (left) and SGLRW (right) discretisations of Langevin dynamics, we can observe that the lattice based discretisation suppresses large parameter jumps that occur due to minibatch noise, resulting in more stable sampling.
  • Figure 2: Multimodal univariate target with exact gradient corrupted by synthetic $\alpha$-stable noise ($\alpha = 1.5$) of increasing scale. We observe that as we increase the noise scale SGLD quickly fails while SGLRW remains stable.
  • Figure 3: Mean-squared error (MSE) of the posterior covariance as a function of the step size $\delta_t$, shown for different batch sizes for $50$-dimensional Bayesian linear regression.
  • Figure 4: Covariance difference matrices $\Sigma_{\text{est}}-\Sigma_{\text{true}}$ for Bayesian linear regression at stepsize $\delta_t=10^{-3}$, shown across increasing minibatch sizes $B$. Top: Clipped SGLD. Middle: SGLD. Bottom: SGLRW. Each panel visualizes the deviation of the empirical posterior covariance from the analytic posterior covariance; the Frobenius norm (Frob) reports the total error magnitude. We observe that the error in the diagonal terms of the estimated covariance matrices is lower for SGLRW than SGLD and Clipped-SGLD
  • Figure 5: Relative improvement of SGLRW over clipped SGLD at increased learning-rate scale ($\eta_0 = 1.5 \times 10^{-4}$). Heatmaps show percentage differences in negative log-likelihood (left) and expected calibration error (right) across training-set sizes and minibatch sizes.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 2.1: SGLD Update Rule
  • Definition 4.1: SGLRW Update Rule
  • Theorem 4.2
  • Lemma 4.3
  • proof
  • Remark 1.2
  • Lemma 1.3: Refined weak expansion
  • proof
  • Theorem 1.4
  • proof