Table of Contents
Fetching ...

On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning

Paul Scemama, Ariel Kapusta

TL;DR

This work probes how split conformal prediction interacts with Bayesian deep learning to affect out-of-distribution coverage in multiclass image classification. It analyzes several Bayesian approximation schemes (SGD/MAP, deep ensembles, MFVI, SGHMC, Laplace) and two conformal methods (threshold and adaptive) on CIFAR10-Corrupted and MedMNIST, linking calibration behavior to changes in OOD coverage. The study finds that conformal prediction can both improve and degrade OOD coverage depending on whether the underlying model is over- or under-confident on the calibration set, with larger prediction sets often correlating with better OOD safety but sometimes at substantial in-distribution cost. The paper emphasizes practical diagnostics and cautious deployment, showing that frequentist guarantees from conformal prediction do not automatically translate to robust OOD performance in real-world distribution shifts, and it highlights the need for careful evaluation of calibration and data shift when combining these uncertainty quantification techniques.

Abstract

Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.

On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning

TL;DR

This work probes how split conformal prediction interacts with Bayesian deep learning to affect out-of-distribution coverage in multiclass image classification. It analyzes several Bayesian approximation schemes (SGD/MAP, deep ensembles, MFVI, SGHMC, Laplace) and two conformal methods (threshold and adaptive) on CIFAR10-Corrupted and MedMNIST, linking calibration behavior to changes in OOD coverage. The study finds that conformal prediction can both improve and degrade OOD coverage depending on whether the underlying model is over- or under-confident on the calibration set, with larger prediction sets often correlating with better OOD safety but sometimes at substantial in-distribution cost. The paper emphasizes practical diagnostics and cautious deployment, showing that frequentist guarantees from conformal prediction do not automatically translate to robust OOD performance in real-world distribution shifts, and it highlights the need for careful evaluation of calibration and data shift when combining these uncertainty quantification techniques.

Abstract

Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.
Paper Structure (44 sections, 23 equations, 24 figures, 11 tables)

This paper contains 44 sections, 23 equations, 24 figures, 11 tables.

Figures (24)

  • Figure 1: A conceptual illustration of how conformal prediction can help or harm out-of-distribution coverage for an error tolerance of $0.25$. On the left is a conceptual illustration of how conformal prediction can make the overall machine learning system less confident after conformalizing a model that is overly confident on the calibration dataset. As a consequence, it gains coverage on out-of-distribution examples. The right conceptualizes the opposite direction and illustrates how conformal prediction can reduce coverage on out-of-distribution examples.
  • Figure 2: CIFAR10accuracy plot. The accuracy plot shows, for each corruption intensity, the average accuracy over all corrupted datasets at that intensity.
  • Figure 3: CIFAR10: The first table displays the credible set coverage on the calibration dataset to indicate over- and under-confident predictive methods. The second table displays the average set sizes on the calibration dataset. For each predictive model method, the prediction set method with highest average set size is bolded.
  • Figure 5: CIFAR10 and CIFAR10-Corrupted results at the 0.05 error tolerance. Row 1 and Row 3 illustrate the average coverage and average set size (respectively) of each prediction set method ($thr$, $ap$, simple predictive ($cred$)) for each predictive modeling method. To explicitly indicate the extent to which conformal prediction effects simple predictive sets, Row 1 shows the average coverage difference between conformal prediction methods ($thr$, $ap$) and simple predictive credible sets ($cred$).
  • Figure 6: CIFAR10: CIFAR10 and CIFAR10-Corrupted results at the 0.01 error tolerance. Row 1 and Row 3 illustrate the average coverage and average set size (respectively) of each prediction set method ($thr$, $ap$, simple predictive) for each predictive modeling method. To explicitly indicate the extent to which conformal prediction effects simple predictive sets, Row 2 shows the average coverage difference between conformal prediction methods ($thr$, $ap$) and simple predictive credible sets.
  • ...and 19 more figures