Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks
Minyoung Kim
TL;DR
The paper addresses scalable posterior inference for Bayesian neural networks by introducing a proximal stochastic-gradient score-based variational inference method. It replaces reparameterization-heavy ELBO objectives with a proximal score-matching objective that can leverage noisy mini-batch scores and support expressive variational families beyond Gaussians, including normalizing flows. The authors provide informal convergence arguments and demonstrate empirical benefits across toy problems, MNIST, large-scale visual recognition (ResNet and ViT), and time-series forecasting, showing faster convergence and improved uncertainty quantification (e.g., lower $ ext{NLL}$ and $ ext{ECE}$) compared with ADVI, while overcoming scalability and numerical issues that hinder GSM and BaM at size. Overall, the method offers a flexible, scalable approach to Bayesian deep learning with practical impact for uncertainty-aware predictions in vision and sequential domains.
Abstract
Bayesian (deep) neural networks (BNN) are often more attractive than the mainstream point-estimate vanilla deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Although there have been quite a few score-based variational inference methods proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers, and allows for richer variational density families. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.
