Table of Contents
Fetching ...

Function-Space MCMC for Bayesian Wide Neural Networks

Lucia Pezzetti, Stefano Favaro, Stefano Peluchetti

TL;DR

The paper tackles uncertainty quantification in Bayesian Neural Networks by examining function-space MCMC sampling on a reparameterized weight posterior that becomes approximately Gaussian as width grows. It proves that the acceptance probabilities of the preconditioned Crank-Nicolson (pCN) and its Langevin variant (pCNL) converge to 1 in the wide-network limit, independent of stepsize, and demonstrates enhanced effective sample size and diagnostics relative to standard LMC. Empirical results on CIFAR-10 show pCN (and to a lesser extent pCNL) scales favorably with width, while LMC deteriorates, making pCN the preferred method for very wide BNNs. A marginal-conditional decomposition further reduces effective sampling dimensionality, and real-world experiments corroborate the theoretical benefits, highlighting practical impact for scalable Bayesian inference in wide neural models.

Abstract

Bayesian Neural Networks represent a fascinating confluence of deep learning and probabilistic reasoning, offering a compelling framework for understanding uncertainty in complex predictive models. In this paper, we investigate the use of the preconditioned Crank-Nicolson algorithm and its Langevin version to sample from a reparametrised posterior distribution of the neural network's weights, as the widths grow larger. In addition to being robust in the infinite-dimensional setting, we prove that the acceptance probabilities of the proposed algorithms approach 1 as the width of the network increases, independently of any stepsize tuning. Moreover, we examine and compare how the mixing speeds of the underdamped Langevin Monte Carlo, the preconditioned Crank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are influenced by changes in the network width in some real-world cases. Our findings suggest that, in wide Bayesian Neural Networks configurations, the preconditioned Crank-Nicolson algorithm allows for a scalable and more efficient sampling of the reparametrised posterior distribution, as also evidenced by a higher effective sample size and improved diagnostic results compared with the other analysed algorithms.

Function-Space MCMC for Bayesian Wide Neural Networks

TL;DR

The paper tackles uncertainty quantification in Bayesian Neural Networks by examining function-space MCMC sampling on a reparameterized weight posterior that becomes approximately Gaussian as width grows. It proves that the acceptance probabilities of the preconditioned Crank-Nicolson (pCN) and its Langevin variant (pCNL) converge to 1 in the wide-network limit, independent of stepsize, and demonstrates enhanced effective sample size and diagnostics relative to standard LMC. Empirical results on CIFAR-10 show pCN (and to a lesser extent pCNL) scales favorably with width, while LMC deteriorates, making pCN the preferred method for very wide BNNs. A marginal-conditional decomposition further reduces effective sampling dimensionality, and real-world experiments corroborate the theoretical benefits, highlighting practical impact for scalable Bayesian inference in wide neural models.

Abstract

Bayesian Neural Networks represent a fascinating confluence of deep learning and probabilistic reasoning, offering a compelling framework for understanding uncertainty in complex predictive models. In this paper, we investigate the use of the preconditioned Crank-Nicolson algorithm and its Langevin version to sample from a reparametrised posterior distribution of the neural network's weights, as the widths grow larger. In addition to being robust in the infinite-dimensional setting, we prove that the acceptance probabilities of the proposed algorithms approach 1 as the width of the network increases, independently of any stepsize tuning. Moreover, we examine and compare how the mixing speeds of the underdamped Langevin Monte Carlo, the preconditioned Crank-Nicolson and the preconditioned Crank-Nicolson Langevin samplers are influenced by changes in the network width in some real-world cases. Our findings suggest that, in wide Bayesian Neural Networks configurations, the preconditioned Crank-Nicolson algorithm allows for a scalable and more efficient sampling of the reparametrised posterior distribution, as also evidenced by a higher effective sample size and improved diagnostic results compared with the other analysed algorithms.
Paper Structure (22 sections, 3 theorems, 52 equations, 6 figures, 2 algorithms)

This paper contains 22 sections, 3 theorems, 52 equations, 6 figures, 2 algorithms.

Key Result

Theorem 2.1

Consider the BNN model with the reparametrisation repar. Then, the acceptance probability of the pCN algorithm to sample from the reparametrised weight posterior, for any $\beta \in [0, 1)$, converges to $1$ as the width of the network increases. If $d_{min}$ denotes the smallest among the network's

Figures (6)

  • Figure 1: Comparison at different stepsizes ($\beta = 0.2, 0.1, 0.01$) of the acceptance probability obtained using: i. the underdamped LMC algorithm (or Metropolis Adjusted Langevin Algorithm: LMC); ii. the pCN algorithm; iii. the pCNL method. The neural network architecture used is a fully-connected with one hidden layer, and layer width that varies among the following values: ${512, 1024, 2048, 4096, 8192}$. The CIFAR-10 dataset is used, with the sample size fixed at $n=256$. The acceptance rate of the pCN increases steadily as the width of the BNN grows with the stepsize $\beta$, suggesting improved performance in wide BNNs and empirically confirming our theoretical analysis. The pCNL algorithm shows a similar trend in its acceptance rate, outperforming the other samples. In contrast, the LMC initially shows generally a deterioration in its acceptance rate as the width of the BNN increases, reflecting the sampler's non-robustness in high-dimensional settings.
  • Figure 2: ESS analysis of the LMC, pCN and pCNL algorithms as a function of the 1-layer FCN's width for stepsizes $\beta = 0.2$ (left), $\beta = 0.1$ (middle) and $\beta = 0.01$ (right). The solid lines represent the average per-step ESS, whereas the shaded areas indicate the variability of the per-step ESS delineated by its minimum and maximum values. The setting used in the experiments is the same as the setting of Figure \ref{['fig:acc_rate_comp']}: the layer width of the BNN varies among the following values: $\{512,\, 1024,\, 2048,\, 4096,\, 8192\}$. The CIFAR-10 dataset is used, with sample size fixed at $n=256$. The poor LMC performance reflects the fact that standard MCMC procedures are ill-posed in high-dimensional settings. In contrast, the pCN and pCNL samplers demonstrate constant growth in ESS as the network width increases, indicating that enhancements in acceptance rate contribute positively to efficiency and performance. Finally, the smallest stepsize, $\beta = 0.01$, heavily affects the behavior of both algorithms, introducing high autocorrelation among the samples and affecting their quality.
  • Figure 3: Evolution of the Gelman-Rubin statistic, evaluated using three independent chains, of the LMC, pCN and pCNL as a function of the number of steps for the stepsize $\beta = 0.1$. The solid lines represent the average Gelman-Rubin statistic, whereas the shaded areas indicate their standard deviation. Again, the CIFAR-10 dataset, with sample size fixed at $n=256$, is used. Since the metric should be close to 1 for all chains to be considered converged, we trace a horizontal line at $\hat{R} = 1.2$, indicating the standard empirical threshold for determining convergence. For all samples the chosen burn-in of 20,000 steps appears to be sufficient to ensure that the chains have reached the target distribution. The metric for pCN sampler improves as the width increases, whereas the LMC method shows complementary results. Finally, the pCNL approach exhibits consistent and good results among all dimensions.
  • Figure 4: Trace plots of the first two principal components for the LMC sampler with different stepsizes. From top to bottom: $\beta = 0.01$, $\beta = 0.1$, and $\beta = 0.2$. A well-mixed chain should exhibit stable oscillations around a mean value, without strong trends or correlations.
  • Figure 5: Trace plots of the first two principal components for the pCN sampler with different stepsizes. From top to bottom: $\beta = 0.01$, $\beta = 0.1$, and $\beta = 0.2$. The stability of the trace plots improves for larger network widths, demonstrating the suitability of pCN in wide BNNs.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 2.1
  • proof
  • Theorem 2.2
  • Theorem 3.1