Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Emanuel Sommer; Kangning Diao; Jakob Robnik; Uros Seljak; David Rügamer

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Emanuel Sommer, Kangning Diao, Jakob Robnik, Uros Seljak, David Rügamer

TL;DR

The paper tackles scalability in Bayesian inference for high-dimensional models by adapting microcanonical Langevin dynamics to stochastic gradients. It identifies that naive mini-batch implementations incur bias when gradient noise is anisotropic and proposes a principled gradient-noise preconditioning (pSMILE) plus an energy-variance-based adaptive tuner to stabilize learning. The resulting SMILE/pSMILE samplers close or match the performance of full-batch MCLMC on challenging Bayesian neural network tasks and demonstrate robustness across architectures such as ResNet-18 and Vision Transformers, with strong uncertainty quantification. This work enables scalable, state-of-the-art Bayesian inference for large-scale models, offering practical tunability and broad applicability to deep ensembles and uncertainty estimation.

Abstract

Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

TL;DR

Abstract

Paper Structure (55 sections, 5 theorems, 35 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 55 sections, 5 theorems, 35 equations, 9 figures, 13 tables, 1 algorithm.

Introduction
Background & Related Work
Monte Carlo Sampling
Full-batch Sampling
MCLMC
Mini-batch Sampling
Improving Sampling for Neural Networks
Stochastic Microcanonical Langevin Dynamics
Sampling Without Explicit Noise Injection
The Pitfall of Anisotropic Stochastic Gradient Noise
Noise Preconditioning
Analytical Benchmarks
A Naive SMILE in Practice
The Benefit of Gradient Noise Preconditioning
Closing the Gap to Full-batch MCMC
...and 40 more sections

Key Result

Proposition 1

If the mini-batching noise can, in continuous time, be modeled as an isotropic Wiener process, the theoretical properties of MCLMC (stationarity and geometric ergodicity) carry over to the continuous-time limit of a stochastic MCLMC sampler.

Figures (9)

Figure 1: Differences between the Bayesian deep ensemble (BDE) performance of naive (orange) and tuned (blue) SMILE variants and a deep ensemble (DE) baseline for a ResNet-7 (428k parameters) on the CIFAR10 dataset. The x-axis is truncated at -0.01 for readability. For all samplers, we report the best performance of an ensemble of 8 chains over a grid of explored step sizes. Standard deviations over replications are comparable to those reported for the larger-scale setting reported in \ref{['tab:resnet18multi_full']}. A more detailed plot covering different step and batch sizes is given in \ref{['fig:resnet9_ablation']}.
Figure 2: Robustness assessment: Perplexity improvement (smaller is better, std. dev. as shaded area) of MCMC sampling over the optimized warmstart across samplers and step sizes for the nanoGPT model with 10.8M parameters on modern-shakespeare.
Figure 3: Posterior samples for the first two dimensions of a 10-dimensional Gaussian Mixture Model. Left: Ground truth density. Middle Left: Single-chain cSGLD samples. Middle Right: Single-chain pSMILE-naive samples. Right: pSMILE-naive samples using 10 independent chains. Both methods were tuned to have a matching average update norm per step. pSMILE-naive successfully traverses modes even in the single-chain setting.
Figure 4: The performance of the SMILE-naive algorithm across various batch and step sizes in comparison with a few baselines on a distributional regression task for the bikesharing dataset. The SGHMC's step size is tuned, and the best performance is displayed.
Figure 5: Relative performances with respect to a Deep Ensemble baseline of a Bayesian Deep Ensemble of LeNets (62k parameters) on the Fashion-MNIST dataset using different sampling routines. The shaded areas represent the minimal and maximal performance across three replications for the respective method. For both SGHMC and SMILE , we performed a grid search of suitable step sizes, and for both methods, 0.001 performed best.
...and 4 more figures

Theorems & Definitions (6)

Proposition 1: informal
Theorem 3.1: informal
Lemma 1
proof
Proposition 1: Stationarity under Isotropic Noise
Theorem 3.1

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

TL;DR

Abstract

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)