Credal Bayesian Deep Learning

Michele Caprio; Souradeep Dutta; Kuk Jin Jang; Vivian Lin; Radoslav Ivanov; Oleg Sokolsky; Insup Lee

Credal Bayesian Deep Learning

Michele Caprio, Souradeep Dutta, Kuk Jin Jang, Vivian Lin, Radoslav Ivanov, Oleg Sokolsky, Insup Lee

TL;DR

Credal Bayesian Deep Learning is presented, which allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements, and is better at quantifying and disentangling different types of (predictive) uncertainties than single BNNs and ensemble of BNNs.

Abstract

Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of predictive uncertainty cannot be distinguished properly. We present Credal Bayesian Deep Learning (CBDL). Heuristically, CBDL allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements. This is possible thanks to prior and likelihood finitely generated credal sets (FGCSs), a concept from the imprecise probability literature. Intuitively, convex combinations of a finite collection of prior-likelihood pairs are able to represent infinitely many such pairs. After training, CBDL outputs a set of posteriors on the parameters of the neural network. At inference time, such posterior set is used to derive a set of predictive distributions that is in turn utilized to distinguish between (predictive) aleatoric and epistemic uncertainties, and to quantify them. The predictive set also produces either (i) a collection of outputs enjoying desirable probabilistic guarantees, or (ii) the single output that is deemed the best, that is, the one having the highest predictive lower probability -- another imprecise-probabilistic concept. CBDL is more robust than single BNNs to prior and likelihood misspecification, and to distribution shift. We show that CBDL is better at quantifying and disentangling different types of (predictive) uncertainties than single BNNs and ensemble of BNNs. In addition, we apply CBDL to two case studies to demonstrate its downstream tasks capabilities: one, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that CBDL performs better when compared to an ensemble of BNNs baseline.

Credal Bayesian Deep Learning

TL;DR

Abstract

Paper Structure (50 sections, 9 theorems, 52 equations, 77 figures, 15 tables, 1 algorithm)

This paper contains 50 sections, 9 theorems, 52 equations, 77 figures, 15 tables, 1 algorithm.

Introduction
Background and Preliminaries
Bayesian Neural Networks
Imprecise Probabilities
Quantifying and Disentangling Aleatoric and Epistemic Uncertainties
Our Procedure and Its Properties
CBDL algorithm
Theoretical Properties of CBDL
Experiments
(Predictive) Uncertainty Quantification
In-distribution Evaluation
Out-of-distribution Evaluation
Downstream Tasks Performance
Motion Prediction for Autonomous Racing
Problem.
...and 35 more sections

Key Result

Proposition 3

$\overline{P}$ is the upper probability for $\Pi$ if and only if it is also the upper probability for $\Pi^\prime$. That is, $\overline{P}(A)=\sup_{P\in\Pi}P(A)=\sup_{P^\prime\in\Pi^\prime}P^\prime(A)$, for all $A\subset\Omega$. The same holds for the lower probability.

Figures (77)

Figure 1: Suppose we are in a $3$-class classification setting, so $\Omega=\{\omega_1,\omega_2,\omega_3\}$. Then, any probability measure $P$ on $\Omega$ can be seen as a probability vector. For example, suppose $P(\{\omega_1\})=0.6$, $P(\{\omega_2\})=0.3$, and $P(\{\omega_3\})=0.1$. We have that $P\equiv (0.6,0.3,0.1)^\top$. Since its elements are positive and sum up to $1$, probability vector $P$ belongs to the unit simplex, the purple triangle in the figure. Then, we can specify $\Pi=\{P_1,\ldots,P_5\}$, and obtain as a consequence that $\Pi^\prime=\text{Conv}(\Pi)$ is the orange pentagon. It is a convex polygon with finitely many extreme elements, and it is the geometric representation of a finitely generated credal set.
Figure 2: In this figure, a replica of flint, $\Pi=\{P_1,P_2\}$, where $P_1$ and $P_2$ are two Normal distributions whose probability density functions (pdf's) $p_1$ and $p_2$ are given by the dashed blue and brown curves, respectively. Their convex hull is $\Pi^\prime=\text{Conv}(\Pi)=\{Q : Q=\beta P_1 + (1-\beta)P_2 \text{, for all } \beta \in [0,1]\}$. The pdf $q$ of an element $Q$ of $\Pi^\prime$ is depicted by a solid black curve. In addition, let $A=[-0.8,-0.4]$. Then, $\underline{P}(A)=\int_{-0.8}^{-0.4} p_2(\omega) \text{d}\omega \approx 0$, while $\overline{P}(A)$ is given by the red shaded area under $p_1$, that is, $\overline{P}(A)=\int_{-0.8}^{-0.4} p_1(\omega) \text{d}\omega$.
Figure 3: The $0.25$-HDR from a Normal Mixture density. This picture is a replica of hyndman. The geometric representation of "$75\%$ probability according to ${P}_j$" is the area between the pdf curve $p_j(\omega)$ and the horizontal bar corresponding to ${p}_j^{0.25}$. A higher probability coverage (according to $P_j$) would correspond to a lower constant, so $p_j^{\alpha}<p_j^{0.25}$, for all $\alpha < 0.25$. In the limit, we recover $100\%$ coverage at $p_j^0=0$.
Figure 4: Let $\Delta_\Theta$ denote the space of probability measures on $\Theta$. Suppose that in the analysis at hand we specified three priors and only one likelihood, so $S=1$ and we can drop the $s$ index. Let $\{P_k(\cdot \mid D)\}_{k=1}^3$ be the collection of exact posteriors, so that the black segment represents the exact posterior FGCS. Then, if we project the elements of $\{P_k(\cdot \mid D)\}_{k=1}^3$ onto $\mathbb{S}_1$ via the KL divergence, we obtain the same distribution $\breve{\mathbf{P}}$. This is detrimental to the analysis because such an approximation underestimates the (posterior) epistemic (and possibly also aleatoric) uncertainty faced by the agent. Then, the user could specify a different set $\mathbb{S}_2$ of "well-behaved" distributions onto which project the elements of $\{P_k(\cdot \mid D)\}_{k=1}^3$. In the figure, we see that they are projected onto $\mathbb{S}_2$ via the KL divergence to obtain $\breve{P}_1$, $\breve{P}_2$, and $\breve{P}_3$. The convex hull of these latter, captured by the red shaded triangle, represents the variational approximation of the exact posterior FGCS.
Figure 5: CBDL is more robust to distribution shifts than single BNNs. Here $\mathcal{P}_\text{lik}$ is the convex hull of five plausible likelihoods, and $d$ denotes a generic metric on the space $\Delta_\mathcal{Y}$ of probabilities on $\mathcal{Y}$. We see how $d(\mathcal{P}_\text{lik},L^o)<d(\mathbf{L},L^o)$; if we replace metric $d$ by a generic divergence $\text{div}$, the inequality would still hold.
...and 72 more figures

Theorems & Definitions (25)

Remark 1
Remark 2
Proposition 3
Definition 4
Definition 5
Proposition 6
Remark 7
Proposition 8
Theorem 9
Theorem 10
...and 15 more

Credal Bayesian Deep Learning

TL;DR

Abstract

Credal Bayesian Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (77)

Theorems & Definitions (25)