Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

Minyoung Kim

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

Minyoung Kim

TL;DR

The paper addresses scalable posterior inference for Bayesian neural networks by introducing a proximal stochastic-gradient score-based variational inference method. It replaces reparameterization-heavy ELBO objectives with a proximal score-matching objective that can leverage noisy mini-batch scores and support expressive variational families beyond Gaussians, including normalizing flows. The authors provide informal convergence arguments and demonstrate empirical benefits across toy problems, MNIST, large-scale visual recognition (ResNet and ViT), and time-series forecasting, showing faster convergence and improved uncertainty quantification (e.g., lower $ ext{NLL}$ and $ ext{ECE}$) compared with ADVI, while overcoming scalability and numerical issues that hinder GSM and BaM at size. Overall, the method offers a flexible, scalable approach to Bayesian deep learning with practical impact for uncertainty-aware predictions in vision and sequential domains.

Abstract

Bayesian (deep) neural networks (BNN) are often more attractive than the mainstream point-estimate vanilla deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Although there have been quite a few score-based variational inference methods proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers, and allows for richer variational density families. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

TL;DR

and

) compared with ADVI, while overcoming scalability and numerical issues that hinder GSM and BaM at size. Overall, the method offers a flexible, scalable approach to Bayesian deep learning with practical impact for uncertainty-aware predictions in vision and sequential domains.

Abstract

Paper Structure (25 sections, 1 theorem, 19 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 19 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Problem Setup and Background
Background on ELBO-based VI
Background on Score-based VI
Our Approach
Related Work
Experiments
Toy/Synthetic Experiments
Gaussian Target Cases
Non-Gaussian Target Cases
MNIST Experiment
Large-scale BNNs for Visual Recognition
Time-series Forecasting with BNNs
Normalizing Flow Variational Density
Ablation Study
...and 10 more sections

Key Result

Lemma A.1

Assume that some regularity conditions hold for two distributions $p(\theta)$ and $q(\theta)$ (e.g., smooth distributions and Lipschitz continuous scores). We have $p = q$ if $\nabla_\theta \log p(\theta) = \nabla_\theta \log q(\theta)$ for all $\theta$ almost surely.

Figures (6)

Figure 1: (Gaussian target cases) (Top row) Convergence as the number of score calls increases for forward KL (Left) and the error in Gaussian parameters $\lambda = (m,V)$ (Right). (Bottom row) Convergence for initial $q$ with small $\textrm{Cov}(q)$ eigenvalues.
Figure 2: (Mixtures of Gaussians) (a) Case-1: Convergence in the order-2 mixture of Gaussians target case. Ours and ADVI use an order-2 mixture of Gaussians $q$ (Left) and an order-3 mixture of Gaussians $q$ (i.e., model mismatch) (Right). (b) Case-2: When the target $\pi$ has order 5. (c) Case-3: When $\dim(\theta)\!=\!30$.
Figure 3: (MNIST) Impact of training data size on test prediction and uncertainty quantification.
Figure 4: (Large-scale Bayesian deep learning) Convergence in test cross entropy losses.
Figure 5: (PosteriorDB) Convergence with noise-free scores.
...and 1 more figures

Theorems & Definitions (3)

Claim
Lemma A.1: Score matching implies distribution matching
proof : Proof of Lemma \ref{['lemma:sm=dm']}

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

TL;DR

Abstract

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)