Data Subsampling for Bayesian Neural Networks

Eiji Kawasaki; Markus Holzmann; Lawrence Adu-Gyamfi

Data Subsampling for Bayesian Neural Networks

Eiji Kawasaki, Markus Holzmann, Lawrence Adu-Gyamfi

TL;DR

This paper tackles the scalability bottleneck of Bayesian posterior sampling for neural networks by introducing Penalty Bayesian Neural Networks (PBNNs), which use mini-batch likelihood evaluations together with a noise-penalty in the Metropolis–Hastings acceptance to achieve unbiased posterior sampling. By modeling the mini-batch loss difference as noisy, the method adds a penalty term that accounts for variance, enabling accurate sampling even with small batch sizes $n$ and multiple batches $M$. The authors develop a full algorithm (PBNN) and discuss extensions to Penalized Langevin Dynamics (PMLD) including MALA and ULA variants, providing a pathway to calibrate predictive distributions via $n$ and to deploy in federated settings where data are decentralized. Empirical results on synthetic tasks and MNIST demonstrate robust predictive performance and improved calibration (reduced overconfidence) when varying mini-batch sizes, highlighting the approach’s practical impact for scalable Bayesian UQ in deep learning.

Abstract

Markov Chain Monte Carlo (MCMC) algorithms do not scale well for large datasets leading to difficulties in Neural Network posterior sampling. In this paper, we propose Penalty Bayesian Neural Networks - PBNNs, as a new algorithm that allows the evaluation of the likelihood using subsampled batch data (mini-batches) in a Bayesian inference context towards addressing scalability. PBNN avoids the biases inherent in other naive subsampling techniques by incorporating a penalty term as part of a generalization of the Metropolis Hastings algorithm. We show that it is straightforward to integrate PBNN with existing MCMC frameworks, as the variance of the loss function merely reduces the acceptance probability. By comparing with alternative sampling strategies on both synthetic data and the MNIST dataset, we demonstrate that PBNN achieves good predictive performance even for small mini-batch sizes of data. We show that PBNN provides a novel approach for calibrating the predictive distribution by varying the mini-batch size, significantly reducing predictive overconfidence.

Data Subsampling for Bayesian Neural Networks

TL;DR

and multiple batches

. The authors develop a full algorithm (PBNN) and discuss extensions to Penalized Langevin Dynamics (PMLD) including MALA and ULA variants, providing a pathway to calibrate predictive distributions via

and to deploy in federated settings where data are decentralized. Empirical results on synthetic tasks and MNIST demonstrate robust predictive performance and improved calibration (reduced overconfidence) when varying mini-batch sizes, highlighting the approach’s practical impact for scalable Bayesian UQ in deep learning.

Abstract

Paper Structure (22 sections, 33 equations, 4 figures, 1 algorithm)

This paper contains 22 sections, 33 equations, 4 figures, 1 algorithm.

INTRODUCTION
BACKGROUND
RELATED WORK
Stochastic Gradient Langevin Dynamics
Noisy Posterior Sampling Bias
Adaptive Subsampling Approach
Failures of Data Set Splitting Inference
PENALTY BAYESIAN NEURAL NETWORK
Biased Posterior Sampling Because Of Mini-Batches
Noise Penalty Theory
Expected Posterior Sampling Using Mini-Batches
Large Number Of Mini-Batches $M$ Scenario
PBNN ALGORITHM
Penalty Bayesian Neural Network Posterior Sampling Algorithm
Prediction Calibration Using The Mini-Batch Size $n$
...and 7 more sections

Figures (4)

Figure 1: Illustration of posterior predictive distributions, defined in \ref{['equation: predictive distribution']}, computed for a linear regression task. The coloured areas correspond to the mean of the distributions $\pm$ one standard deviation. The blue area is computed by naively replacing $\Delta$ by $\delta$ in the MH algorithm from \ref{['equation: MH usual acceptance']}. The noisy loss difference $\delta$ as defined by \ref{['equation: usual delta']} is computed on a single mini-batch of 2 data points. The yellow area shows a similar computation, that incorporates the noise penalty term as defined in \ref{['equation: random walk penalty acceptance']}. It correctly matches the red area corresponding to the analytical Bayesian linear regression with Gaussian prior and known variance bishop2007.
Figure 2: Illustration of posterior predictive distributions, defined in \ref{['equation: PBNN predictive posterior']}. Shaded regions indicate predictive means ± one, two and three standard deviations. We use a homoscedastic Gaussian likelihood (cf. loss defined in \ref{['equation: MSE']}). The noise in the data is small such that the visible posterior distribution variance is due to the epistemic uncertainty. The training data set contains $N=2000$ points meaning that at each step, only a fraction of the data points is used by the PBNN. We emphasize that none of these images are expected to match \ref{['fig:bnn_regression']} as each targets a different posterior distribution.
Figure 3: Reliability diagrams on test data. BNN prediction corresponds to \ref{['equation: predictive distribution']} whereas the prediction of PBNN is computed using \ref{['equation: PBNN predictive posterior']}. MNIST classifiers obtain a similar accuracy test score: 93.2% using a PBNN, and 93.6% using a BNN. The architecture of the softmax classifier is a single hidden layers containing 20 neurons.
Figure 4: SGLD results are very sensitive to the learning rate $\eta$. \ref{['figure: PBNN reference']} is used as a reference as it targets the same posterior as \ref{['fig:sgld_1e-7.png']} and \ref{['fig:sgld_1e-8.png']}. We note that as expected, Safe Bayes approaches obtain a qualitatively different result from PBNN for the same mini-batch size.

Data Subsampling for Bayesian Neural Networks

TL;DR

Abstract

Data Subsampling for Bayesian Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)