Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling
Dongping Zhang, Angelos Chatzimparmpas, Negar Kamali, Jessica Hullman
TL;DR
This study evaluates conformal prediction sets, specifically RAPS, as a distribution-free approach to expressing uncertainty in AI-advised image labeling. Through a large online experiment that varies distribution (ID vs OOD), task difficulty (easy vs hard), and prediction-set size, the authors compare RAPS against Top-$1$ and Top-$k$ displays and a no-prediction baseline. They find that well-calibrated, small prediction sets are generally more useful for both decision quality and cognitive load in easy in-distribution tasks, while larger, adaptive prediction sets can offer advantages for hard OOD instances where Top-$k$ is unreliable. The results highlight practical trade-offs in uncertainty presentation, suggesting that deployment should consider model calibration and instance difficulty to maximize human-AI decision quality. Together, the work provides guidance on integrating conformal prediction into real-world AI-advised labeling systems and outlines avenues for improving uncertainty communication under distributional shifts.
Abstract
As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the presentation of conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets offer some advantage in assisting humans in labeling out-of-distribution (OOD) images in the setting that we studied, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.
