Table of Contents
Fetching ...

Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

Misgina Tsighe Hagos, Claes Lundström

TL;DR

This work empirically evaluates whether conformal prediction (CP) effectively captures aleatoric uncertainty arising from class overlap. It compares three CP approaches—LAC, APS, and RAPS—applied to eight models across four multi-annotator datasets (CIFAR-10H, MLRSNet, FER+, ImageNet-ReaL) and measures correlations between prediction-set size and human annotator ambiguity, as well as similarity to annotator labels and calibration via ECE. The key finding is that CP outputs typically exhibit only weak correlations with human-perceived ambiguity, even though set sizes strongly correlate with softmax entropy; larger sets improve overlap alignment only in some cases and do not guarantee better coverage. The results underscore that prediction-set size should not be treated as a direct proxy for aleatoric uncertainty and motivate developing CP variants that more closely reflect intrinsic labeling ambiguity in data-driven tasks.

Abstract

Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.

Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

TL;DR

This work empirically evaluates whether conformal prediction (CP) effectively captures aleatoric uncertainty arising from class overlap. It compares three CP approaches—LAC, APS, and RAPS—applied to eight models across four multi-annotator datasets (CIFAR-10H, MLRSNet, FER+, ImageNet-ReaL) and measures correlations between prediction-set size and human annotator ambiguity, as well as similarity to annotator labels and calibration via ECE. The key finding is that CP outputs typically exhibit only weak correlations with human-perceived ambiguity, even though set sizes strongly correlate with softmax entropy; larger sets improve overlap alignment only in some cases and do not guarantee better coverage. The results underscore that prediction-set size should not be treated as a direct proxy for aleatoric uncertainty and motivate developing CP variants that more closely reflect intrinsic labeling ambiguity in data-driven tasks.

Abstract

Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.

Paper Structure

This paper contains 29 sections, 11 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Distinct human labels distribution in the CIFAR-10H, MLRSNet, FER+, and ImageNet-Real datasets. The X-axis shows the number of distinct human labels per image, and the Y-axis shows the number of images.
  • Figure 2: Conformal prediction set outputs for sample images from the four datasets. Columns from left to right: CIFAR-10H (leftmost), MLRSNet, FER+, and ImageNet-Real (rightmost).
  • Figure A1: Distribution of prediction set sizes. The y-axis is log-scaled for easier visualization of the imbalanced counts. The count of the largest three prediction set sizes is shown on top of the bars. The three columns, from left to right, represent LAC, APS, and RAPS, respectively.
  • Figure A2: Coverage at different prediction set sizes. The X-axis shows prediction set sizes, and the Y-axis shows the coverage. The three columns, from left to right, represent LAC, APS, and RAPS, respectively.
  • Figure A3: Spearman's rank correlation coefficient, $r_s$, $p < .001$, between prediction set sizes and class overlap with an increased prevalence of larger prediction sets. The X-axis shows the upper limit of the considered prediction set sizes $(|C| \leq x)$, and the Y-axis shows Spearman's rank correlation coefficient. The three columns, from left to right, represent LAC, APS, and RAPS, respectively.