Table of Contents
Fetching ...

Sampling Parallelism for Fast and Efficient Bayesian Learning

Asena Karolin Özdemir, Lars H. Heyen, Arvid Weyrauch, Achim Streit, Markus Götz, Charlotte Debus

Abstract

Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

Sampling Parallelism for Fast and Efficient Bayesian Learning

Abstract

Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

Paper Structure

This paper contains 22 sections, 2 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: BNN Training with Parallel Sampling
  • Figure 2: Distributing Training with Hybrid Parallelization
  • Figure 3: Simplified illustration of the SWIN transformer architecture used for the weather forecasting task; patch embedding and recovery is performed via a 2d convolution or transpose convolution layer with kernel size = stride = $(2, 2)$ over the latitude and longitude dimensions
  • Figure 4: Speed-up (fixed-sample scaling, top) and efficiency (proportional-sample scaling, bottom) of training the Bayesian ViT on CIFAR10. For fixed-sample scaling, we compare different sampling parallelism, data parallelism, and a hybrid of both, using 16 random samples and either a global batch size of 256 or an increasing global batch size of 256 times number of GPUs. For proportional-sample scaling we use a fixed global batch size of 1024 and scale the number of random samples from 1 to 16.
  • Figure 5: Accuracy of the Bayesian ViT on CIFAR10 with two different global batch sizes and 16 random samples, using sampling parallelism, DDP and hybrid parallelism, over the course of training, for different numbers of GPUs.
  • ...and 5 more figures