Central Limit Theorem for Bayesian Neural Network trained with Variational Inference
Arnaud Descours, Tom Huix, Arnaud Guillin, Manon Michel, Éric Moulines, Boris Nectoux
TL;DR
This work derives Central Limit Theorems for a two-layer Bayesian neural network trained by variational inference in the infinite-width regime, covering three SGD schemes: idealized SGD with exact Gaussian integrals, Bayes-by-Backprop (BbB) SGD with Monte Carlo estimates, and Minimal VI (MiVI) SGD. It proves that the centered empirical measure fluctuations converge to a Gaussian process driven by an SPDE, with the limiting covariance differing between MiVI and the BbB/Idealized schemes. The LLN groundwork from prior mean-field analyses is extended to obtain a full fluctuation theory, and numerical experiments show MiVI achieves substantial computational gains despite larger finite-width variances. These results provide a rigorous, trajectorial understanding of VI-trained BNN behavior and guide practical algorithm choice by balancing variance against computational cost.
Abstract
In this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.
