Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Zier Mensch; Lars Holdijk; Samuel Duffield; Maxwell Aifer; Patrick J. Coles; Max Welling; Miranda C. N. Cheng

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Zier Mensch, Lars Holdijk, Samuel Duffield, Maxwell Aifer, Patrick J. Coles, Max Welling, Miranda C. N. Cheng

TL;DR

Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Abstract

Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

TL;DR

Abstract

Paper Structure (45 sections, 4 theorems, 78 equations, 11 figures, 2 tables)

This paper contains 45 sections, 4 theorems, 78 equations, 11 figures, 2 tables.

Introduction
Background
Bayesian Machine Learning
Bayesian Posterior Sampling
Stochastic Gradient Methods.
Stochastic Gradient Langevin Dynamics
Related Work
Batch-Size Sensitivity.
Large-Scale Bayesian Inference.
Stochastic Gradient Lattice Random Walk
Lattice Random Walk
Stochastic Gradient Lattice Random Walk
Heavy-Tailed Noise
Mean Squared Error Analysis
Validation and practical considerations
...and 30 more sections

Key Result

Theorem 4.2

Under the Assumption ass:lyap, we find that the MSE is bounded by the following three contributions for some $C$ that depends on the target distribution. The covariance error term for SGLRW is never larger than that of SGLD while the other contributions are the same for both. Moreover, it is strictly smaller whenever $2\partial_iU\zeta_i +\zeta_i^2$ is non-vanishing for some direction $i$.

Figures (11)

Figure 1: Comparing SGLD (left) and SGLRW (right) discretisations of Langevin dynamics, we can observe that the lattice based discretisation suppresses large parameter jumps that occur due to minibatch noise, resulting in more stable sampling.
Figure 2: Multimodal univariate target with exact gradient corrupted by synthetic $\alpha$-stable noise ($\alpha = 1.5$) of increasing scale. We observe that as we increase the noise scale SGLD quickly fails while SGLRW remains stable.
Figure 3: Mean-squared error (MSE) of the posterior covariance as a function of the step size $\delta_t$, shown for different batch sizes for $50$-dimensional Bayesian linear regression.
Figure 4: Covariance difference matrices $\Sigma_{\text{est}}-\Sigma_{\text{true}}$ for Bayesian linear regression at stepsize $\delta_t=10^{-3}$, shown across increasing minibatch sizes $B$. Top: Clipped SGLD. Middle: SGLD. Bottom: SGLRW. Each panel visualizes the deviation of the empirical posterior covariance from the analytic posterior covariance; the Frobenius norm (Frob) reports the total error magnitude. We observe that the error in the diagonal terms of the estimated covariance matrices is lower for SGLRW than SGLD and Clipped-SGLD
Figure 5: Relative improvement of SGLRW over clipped SGLD at increased learning-rate scale ($\eta_0 = 1.5 \times 10^{-4}$). Heatmaps show percentage differences in negative log-likelihood (left) and expected calibration error (right) across training-set sizes and minibatch sizes.
...and 6 more figures

Theorems & Definitions (10)

Definition 2.1: SGLD Update Rule
Definition 4.1: SGLRW Update Rule
Theorem 4.2
Lemma 4.3
proof
Remark 1.2
Lemma 1.3: Refined weak expansion
proof
Theorem 1.4
proof

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

TL;DR

Abstract

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (10)