Table of Contents
Fetching ...

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling

Dongping Zhang, Angelos Chatzimparmpas, Negar Kamali, Jessica Hullman

TL;DR

This study evaluates conformal prediction sets, specifically RAPS, as a distribution-free approach to expressing uncertainty in AI-advised image labeling. Through a large online experiment that varies distribution (ID vs OOD), task difficulty (easy vs hard), and prediction-set size, the authors compare RAPS against Top-$1$ and Top-$k$ displays and a no-prediction baseline. They find that well-calibrated, small prediction sets are generally more useful for both decision quality and cognitive load in easy in-distribution tasks, while larger, adaptive prediction sets can offer advantages for hard OOD instances where Top-$k$ is unreliable. The results highlight practical trade-offs in uncertainty presentation, suggesting that deployment should consider model calibration and instance difficulty to maximize human-AI decision quality. Together, the work provides guidance on integrating conformal prediction into real-world AI-advised labeling systems and outlines avenues for improving uncertainty communication under distributional shifts.

Abstract

As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the presentation of conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets offer some advantage in assisting humans in labeling out-of-distribution (OOD) images in the setting that we studied, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling

TL;DR

This study evaluates conformal prediction sets, specifically RAPS, as a distribution-free approach to expressing uncertainty in AI-advised image labeling. Through a large online experiment that varies distribution (ID vs OOD), task difficulty (easy vs hard), and prediction-set size, the authors compare RAPS against Top- and Top- displays and a no-prediction baseline. They find that well-calibrated, small prediction sets are generally more useful for both decision quality and cognitive load in easy in-distribution tasks, while larger, adaptive prediction sets can offer advantages for hard OOD instances where Top- is unreliable. The results highlight practical trade-offs in uncertainty presentation, suggesting that deployment should consider model calibration and instance difficulty to maximize human-AI decision quality. Together, the work provides guidance on integrating conformal prediction into real-world AI-advised labeling systems and outlines avenues for improving uncertainty communication under distributional shifts.

Abstract

As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the presentation of conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets offer some advantage in assisting humans in labeling out-of-distribution (OOD) images in the setting that we studied, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.
Paper Structure (51 sections, 1 equation, 11 figures, 10 tables, 2 algorithms)

This paper contains 51 sections, 1 equation, 11 figures, 10 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview diagram of our key experimental manipulations. (1) Five different covariate shifts are imposed through synthetic image corruption to create five replications of the conformal hold-out set, each containing images that are OOD. (2) Images in each conformal hold-out set are categorized by the classifier's prediction confidence for difficulty and the size of the derived prediction set. Ten task images representative of the categories used to define each group are selected. (3) Participants label 16 task images sampled from 80 candidate images: four in-distribution and 12 OOD, balanced by difficulty and set size, presented in randomized order. Example task stimuli are shown in \ref{['fig:stimuli-example']}. (4) Based on the conditions assigned, participants may complete labeling tasks without predictions (i.e., baseline) or with access to prediction displays that vary in the content provided by uncertainty quantification (i.e., Top-1, Top-10, or RAPS). Screenshots of the interface as seen by participants are presented in \ref{['fig:interface-display']}.
  • Figure 2: We present four example stimuli (from a total of 10) for each combination of in-distribution or OOD, difficulty, and size categories. The rows differentiate between in-distribution and OOD, while the columns vary by difficulty and set size.
  • Figure 3: Screenshots of the interface participants used to complete the study by conditions (baseline, Top-1, Top-10, RAPS).
  • Figure 4: Participants are provided with three search options to find their preferred choice. (A) Dropdown search: As participants type in the response field, a dropdown menu appears, exemplified by the entry "computer". (B) Keyword search: Participants can search their typed keywords in the WordNet hierarchy by clicking the "Search" button, which opens an overlay displaying a hierarchy network with the relevant network components highlighted. We provide an example of the rendered hierarchy network by searching for "computer"; (C) Bottom-up search: By clicking on a predicted label, such as "computer mouse", participants can see its path from the root node with categories on the path that can be clicked to expand for further exploration. Additionally, while exploring the hierarchy network, participants can hover over any leaf node to see label-representative images, as shown in (B) and (C).
  • Figure 5: Distribution of participants' study completion time (minutes) and bonus earned by conditions with a solid vertical line showing average and dotted vertical lines showing the interquartile range.
  • ...and 6 more figures