Table of Contents
Fetching ...

Out-of-Distribution Detection Should Use Conformal Prediction (and Vice-versa?)

Paul Novello, Joseba Dalmau, Léo Andeol

TL;DR

New conformal AUROC and conformal FRP@TPR95 metrics are defined, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics.

Abstract

Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In this work, we propose to use CP to better assess the efficiency of OOD scores. Specifically, we emphasize that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the finite sample size of the test dataset. Based on the work of (Bates et al., 2022), we define new conformal AUROC and conformal FRP@TPR95 metrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al., 2022) and ADBench (Han et al., 2022). We also show that the benefits of using OOD together with CP apply the other way around by using OOD scores as non-conformity scores, which results in improving upon current CP methods. One of the key messages of these contributions is that since OOD is concerned with designing scores and CP with interpreting these scores, the two fields may be inherently intertwined.

Out-of-Distribution Detection Should Use Conformal Prediction (and Vice-versa?)

TL;DR

New conformal AUROC and conformal FRP@TPR95 metrics are defined, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics.

Abstract

Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In this work, we propose to use CP to better assess the efficiency of OOD scores. Specifically, we emphasize that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the finite sample size of the test dataset. Based on the work of (Bates et al., 2022), we define new conformal AUROC and conformal FRP@TPR95 metrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al., 2022) and ADBench (Han et al., 2022). We also show that the benefits of using OOD together with CP apply the other way around by using OOD scores as non-conformity scores, which results in improving upon current CP methods. One of the key messages of these contributions is that since OOD is concerned with designing scores and CP with interpreting these scores, the two fields may be inherently intertwined.
Paper Structure (34 sections, 21 equations, 4 figures, 8 tables)

This paper contains 34 sections, 21 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Histogram of $F(0.1;\mathcal{D}_{id}^{cal})$ for different calibration sets. The histogram is obtained by splitting the dataset svhn_extra into disjoint calibration sets of $10000$ points each, and approximating the value of $F$ for each calibration set by integrating over the remaining 521131 examples.
  • Figure 2: Cumulative histogram of the marginal p-values (blue) and the calibration-conditional p-values (brown) obtained by performing the Simes adjustment method. Four zooms of the same plot are shown, obtained with a calibration dataset of 10000 points from the SVHN dataset. The approximation of the marginal p-values becomes poor for smaller values of $\alpha$, and it can be overly optimistic. The correction is conservative for all values of $\alpha$simultaneously, as shown in the figure, which happens with probability $\delta=0.1$ over the choice of the calibration dataset.
  • Figure 3: Different zoom levels of the ROC curves. The TPR is calculated by using all the points in the "Cifar10" dataset for the three curves. As for the TPR, the blue curve is obtained by using all data points in the "svhn_extra" dataset, the orange curve is an approximation of the blue curve using 1000 calibration points, whereas the green curve is obtained by correcting the FPR via the conformal AUROC method.
  • Figure 4: Results visualization for ADBench benchmark. (left) Scatter plot with mean classical AUROC and mean AUROC correction over different methods for each dataset as y-axis and x-axis, respectively. (right) Mean AUROC and AUROC correction over different datasets for each AD method.