On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning
Paul Scemama, Ariel Kapusta
TL;DR
This work probes how split conformal prediction interacts with Bayesian deep learning to affect out-of-distribution coverage in multiclass image classification. It analyzes several Bayesian approximation schemes (SGD/MAP, deep ensembles, MFVI, SGHMC, Laplace) and two conformal methods (threshold and adaptive) on CIFAR10-Corrupted and MedMNIST, linking calibration behavior to changes in OOD coverage. The study finds that conformal prediction can both improve and degrade OOD coverage depending on whether the underlying model is over- or under-confident on the calibration set, with larger prediction sets often correlating with better OOD safety but sometimes at substantial in-distribution cost. The paper emphasizes practical diagnostics and cautious deployment, showing that frequentist guarantees from conformal prediction do not automatically translate to robust OOD performance in real-world distribution shifts, and it highlights the need for careful evaluation of calibration and data shift when combining these uncertainty quantification techniques.
Abstract
Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.
