Table of Contents
Fetching ...

Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

Emanuel Sommer, Jakob Robnik, Giorgi Nozadze, Uros Seljak, David Rügamer

TL;DR

This work tackles the bottleneck of sampling-based Bayesian inference for large, multimodal neural posteriors by introducing Microcanonical Langevin Ensembles (MILE), which embed an adapted Microcanonical Langevin Monte Carlo (MCLMC) within an ensemble framework initialized by deep optimization. By combining three-phase tuning (step size, energy-variance scheduling, and ESS targeting), deep ensemble initialization, and deterministic gradient steps, MILE delivers up to an order-of-magnitude speedup over NUTS-based approaches while preserving or improving predictive performance and uncertainty quantification. Extensive UCI, CNN, and attention-based benchmarks demonstrate strong scalability and consistent resource predictability, with clear advantages in higher-dimensional models and larger datasets. The method reduces runtime variability, enables parallelization, and provides a practical, auto-tuned option for sampling-based BNN inference, marking a significant step toward scalable probabilistic deep learning. Future work could extend MILE with stochastic-gradient variants and explore alternative priors to broaden applicability.

Abstract

Despite recent advances, sampling-based inference for Bayesian Neural Networks (BNNs) remains a significant challenge in probabilistic deep learning. While sampling-based approaches do not require a variational distribution assumption, current state-of-the-art samplers still struggle to navigate the complex and highly multimodal posteriors of BNNs. As a consequence, sampling still requires considerably longer inference times than non-Bayesian methods even for small neural networks, despite recent advances in making software implementations more efficient. Besides the difficulty of finding high-probability regions, the time until samplers provide sufficient exploration of these areas remains unpredictable. To tackle these challenges, we introduce an ensembling approach that leverages strategies from optimization and a recently proposed sampler called Microcanonical Langevin Monte Carlo (MCLMC) for efficient, robust and predictable sampling performance. Compared to approaches based on the state-of-the-art No-U-Turn Sampler, our approach delivers substantial speedups up to an order of magnitude, while maintaining or improving predictive performance and uncertainty quantification across diverse tasks and data modalities. The suggested Microcanonical Langevin Ensembles and modifications to MCLMC additionally enhance the method's predictability in resource requirements, facilitating easier parallelization. All in all, the proposed method offers a promising direction for practical, scalable inference for BNNs.

Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

TL;DR

This work tackles the bottleneck of sampling-based Bayesian inference for large, multimodal neural posteriors by introducing Microcanonical Langevin Ensembles (MILE), which embed an adapted Microcanonical Langevin Monte Carlo (MCLMC) within an ensemble framework initialized by deep optimization. By combining three-phase tuning (step size, energy-variance scheduling, and ESS targeting), deep ensemble initialization, and deterministic gradient steps, MILE delivers up to an order-of-magnitude speedup over NUTS-based approaches while preserving or improving predictive performance and uncertainty quantification. Extensive UCI, CNN, and attention-based benchmarks demonstrate strong scalability and consistent resource predictability, with clear advantages in higher-dimensional models and larger datasets. The method reduces runtime variability, enables parallelization, and provides a practical, auto-tuned option for sampling-based BNN inference, marking a significant step toward scalable probabilistic deep learning. Future work could extend MILE with stochastic-gradient variants and explore alternative priors to broaden applicability.

Abstract

Despite recent advances, sampling-based inference for Bayesian Neural Networks (BNNs) remains a significant challenge in probabilistic deep learning. While sampling-based approaches do not require a variational distribution assumption, current state-of-the-art samplers still struggle to navigate the complex and highly multimodal posteriors of BNNs. As a consequence, sampling still requires considerably longer inference times than non-Bayesian methods even for small neural networks, despite recent advances in making software implementations more efficient. Besides the difficulty of finding high-probability regions, the time until samplers provide sufficient exploration of these areas remains unpredictable. To tackle these challenges, we introduce an ensembling approach that leverages strategies from optimization and a recently proposed sampler called Microcanonical Langevin Monte Carlo (MCLMC) for efficient, robust and predictable sampling performance. Compared to approaches based on the state-of-the-art No-U-Turn Sampler, our approach delivers substantial speedups up to an order of magnitude, while maintaining or improving predictive performance and uncertainty quantification across diverse tasks and data modalities. The suggested Microcanonical Langevin Ensembles and modifications to MCLMC additionally enhance the method's predictability in resource requirements, facilitating easier parallelization. All in all, the proposed method offers a promising direction for practical, scalable inference for BNNs.

Paper Structure

This paper contains 61 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Flowchart illustrating our proposed procedure for obtaining a Microcanonical Langevin Ensemble (MILE) for BNNs. The process involves three main stages: optimization, MCLMC warmup and tuning, and MCLMC sampling. These steps are parallelized to generate an ensemble of $K$ members. The number of MCLMC steps for each tuning phase and the final sampling phase are annotated, and carryovers between stages are highlighted in circles.
  • Figure 2: Average gradient evaluations per chain for 1000 posterior samples for the experiments reported in Table \ref{['tab:uci_repl_main']}.
  • Figure 3: Average sampling wallclock times (minutes, y-axis) of BDE (blue) and MILE (orange) for the bikesharing dataset across 4 NN architectures with increasing parameter count (x-axis) on the upper left. Average sampling wallclock times (hours, y-axis) for the protein dataset across varying training data sizes (x-axis) on the upper right. Dashed lines indicate power-law and quadratic model fits respectively. In both cases the sampling time ratio between BDE and MILE is around 7-9, independently of the number of parameters and observations. This is a result of NUTS always being close to its maximum number of iterations per sample, which we set to the default value of 1024 gradient calls. It therefore uses around $1024 \times (1000 + 100) \approx 11 \times 10^5$ gradient calls, as displayed also in Figure \ref{['fig:grad_evals']}. MILE on the other hand always uses $2 \times 60000 = 12 \times 10^4$ calls, which gives a ratio of $9.2$. The bottom row shows hold-out metric performances across 4 network architectures. DE performance for the LPPD and RMSE metrics is indicated as a grey reference. All charts come with standard errors over 3 data splits.
  • Figure 4: Results of the ablation studies conducted on the bikesharing dataset for the robustness of the MILE algorithm to its tuning parameters (x-axes, proposed defaults bold). Both the average hold-out RMSE and LPPD are reported with their standard error for 3 data splits. The same holds for the major parameters of the sampling kernel $L$ and the step size that were tuned by the proposed tuning. "#Effective samples to estimate EEVPD" and "Trust in the estimate" are minor parameters of the step size adaptation algorithm which determine the sample weighting during the EEVPD computation.
  • Figure 5: Schematic overview of the sequential attention-based model architecture (ATT) that is applied to the IMDB Dataset.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Calibration
  • Definition 2: Calibration error