Performance of Conformal Prediction in Capturing Aleatoric Uncertainty
Misgina Tsighe Hagos, Claes Lundström
TL;DR
This work empirically evaluates whether conformal prediction (CP) effectively captures aleatoric uncertainty arising from class overlap. It compares three CP approaches—LAC, APS, and RAPS—applied to eight models across four multi-annotator datasets (CIFAR-10H, MLRSNet, FER+, ImageNet-ReaL) and measures correlations between prediction-set size and human annotator ambiguity, as well as similarity to annotator labels and calibration via ECE. The key finding is that CP outputs typically exhibit only weak correlations with human-perceived ambiguity, even though set sizes strongly correlate with softmax entropy; larger sets improve overlap alignment only in some cases and do not guarantee better coverage. The results underscore that prediction-set size should not be treated as a direct proxy for aleatoric uncertainty and motivate developing CP variants that more closely reflect intrinsic labeling ambiguity in data-driven tasks.
Abstract
Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty and generating sets that align with human annotations remains limited.
