Partially Stochastic Infinitely Deep Bayesian Neural Networks

Sergio Calvo-Ordonez; Matthieu Meunier; Francesco Piatti; Yuantao Shi

Partially Stochastic Infinitely Deep Bayesian Neural Networks

Sergio Calvo-Ordonez, Matthieu Meunier, Francesco Piatti, Yuantao Shi

TL;DR

The paper tackles scalable uncertainty quantification for infinitely deep Bayesian neural networks by introducing Partially Stochastic Infinitely Deep BNNs (PSDE-BNNs). It combines Neural ODEs with Neural SDEs and introduces vertical and horizontal weight-separation schemes to inject partial stochasticity, paired with variational training and OU priors to control complexity. The authors prove expressivity guarantees, showing PSDE-BNNs are Universal Conditional Distribution Approximators under suitable conditions, and demonstrate empirical gains in accuracy, calibration, and uncertainty quantification with substantial efficiency improvements over fully stochastic models. This approach enables practical, uncertainty-aware inference in infinite-depth models and offers a flexible framework for future explorations in variance reduction and broader datasets.

Abstract

In this paper, we present Partially Stochastic Infinitely Deep Bayesian Neural Networks, a novel family of architectures that integrates partial stochasticity into the framework of infinitely deep neural networks. Our new class of architectures is designed to improve the computational efficiency of existing architectures at training and inference time. To do this, we leverage the advantages of partial stochasticity in the infinite-depth limit which include the benefits of full stochasticity e.g. robustness, uncertainty quantification, and memory efficiency, whilst improving their limitations around computational complexity. We present a variety of architectural configurations, offering flexibility in network design including different methods for weight partition. We also provide mathematical guarantees on the expressivity of our models by establishing that our network family qualifies as Universal Conditional Distribution Approximators. Lastly, empirical evaluations across multiple tasks show that our proposed architectures achieve better downstream task performance and uncertainty quantification than their counterparts while being significantly more efficient. The code can be found at \url{https://github.com/Sergio20f/part_stoch_inf_deep}

Partially Stochastic Infinitely Deep Bayesian Neural Networks

TL;DR

Abstract

Paper Structure (27 sections, 7 theorems, 38 equations, 6 figures, 5 tables)

This paper contains 27 sections, 7 theorems, 38 equations, 6 figures, 5 tables.

Introduction
Background and Preliminaries
Neural Ordinary Differential Equations
Partial Stochasticity in Bayesian Neural Networks
SDEs as Approximate Posteriors
Method
Vertical Separation of the Weights
Horizontal Separation of the weights
Training the Network
Expressivity Guarantees
Constrained Infinitely Deep Bayesian Neural Networks are not Universal Approximators
Partially Stochastic Infinitely Deep Bayesian Neural Networks are UCDAs
Experiments
Experimental Setup
Image Classification
...and 12 more sections

Key Result

Lemma 1

Assume the prior and posterior of $w_t$ are given as Assume that $\sigma_p,\sigma_q$ are continuous. If there exists $t \in (0,1)$ such that $\sigma_{q}(t,w_t)\neq \sigma_{p}(t,w_t)$ with non-zero probability, then $D_{\mathrm{KL}}\left(\mu_{q} \| \mu_{p}\right)=\infty$.

Figures (6)

Figure 1: Vertical Cut: Sample paths of $w_t$ when (left) not fixing $w_{t_2}$ vs. (right) fixing $w_{t_2}$. If we do not fix $w_{t_2}$, $w_t$ will be random in the interval $(0.6, 1)$. Here, $f_q = \cos(20t)$, $g_p = 0$ for $t \notin (0.3, 0.6)$, and $f_q = 0$, $g_p = 1$ for $t \in (0.3, 0.6)$.
Figure 2: Horizontal Cut: Sample paths of $w_S$ (left) and $w_D$ when not separating $f_q$ (middle) vs. separating $f_q$ (right). If we do not separate $f_q$, then $w_{D}$ will also be random. Here, when not separating $f_q$, we take $f_q=[-w_S,t+w_D+w_S]$ and $g_p=[1,0]$, when separating $f_q$, we take $f_q=[-w_S,t+w_D]$ and $g_p=[1,0]$.
Figure 3: Histograms depicting the distribution of predictive uncertainty for SDE-BNN (green) and PSDE-BNN ODEFirst (blue), SDEFirst (red), and with horizontal cuts (orange) across CIFAR-10 test predictions (top row) and OOD samples predictions (bottom row). Entropy was computed for each prediction, with higher values indicating greater uncertainty. Note that the legend includes summary statistics of the histograms, i.e. mean and standard deviation.
Figure 4: ROC curves for OOD detection performance across three models: SDE-BNN, PSDE-BNN ODEFirst with $r_s=0.1$, PSDE-BNN SDEFirst with $r_s=0.1$, and the PSDE-BNN with horizontal cut of the weights ($r_s=0.5$). The curves quantify each model's ability to differentiate between in-distribution (CIFAR-10 test set) and OOD samples, with the AUC metric reflecting the discrimination power. A higher AUC value indicates greater efficacy in distinguishing OOD from in-distribution data.
Figure 5: Training, test, and validation accuracy evolution over the first 100 training epochs on the CIFAR-10 dataset. Plots aim to illustrate the rate of learning and convergence stability across epochs and highlight how quickly each model achieves competitive performance. We can observe some symptoms of numerical instability which are inherent to differential equations numerical solvers. We overcome this issue by storing checkpoints only at the best validation accuracy.
...and 1 more figures

Theorems & Definitions (8)

Lemma 1
Theorem 4.1
Theorem 4.2
Lemma 2
Lemma 3
Lemma 4
Theorem 1.1
proof

Partially Stochastic Infinitely Deep Bayesian Neural Networks

TL;DR

Abstract

Partially Stochastic Infinitely Deep Bayesian Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)