Table of Contents
Fetching ...

Using AI Uncertainty Quantification to Improve Human Decision-Making

Laura R. Marusich, Jonathan Z. Bakdash, Yan Zhou, Murat Kantarcioglu

TL;DR

This work probes whether well-calibrated instance-level AI Uncertainty Quantification (UQ) can enhance human decision-making beyond AI predictions. It introduces a sampling-based UQ method with ground-truth calibration verified via a strict scoring rule and evaluates its effect through two preregistered online experiments across multiple datasets. Experiment 1 shows that AI UQ improves decision accuracy and confidence calibration over AI predictions alone, while Experiment 2 finds no robust differences across different uncertainty visualizations or representations. Collectively, the results support the value of high-quality, instance-level UQ for human-AI interaction and highlight that benefits generalize across representations, offering a path toward more reliable AI-assisted decision-making in real systems.

Abstract

AI Uncertainty Quantification (UQ) has the potential to improve human decision-making beyond AI predictions alone by providing additional probabilistic information to users. The majority of past research on AI and human decision-making has concentrated on model explainability and interpretability, with little focus on understanding the potential impact of UQ on human decision-making. We evaluated the impact on human decision-making for instance-level UQ, calibrated using a strict scoring rule, in two online behavioral experiments. In the first experiment, our results showed that UQ was beneficial for decision-making performance compared to only AI predictions. In the second experiment, we found UQ had generalizable benefits for decision-making across a variety of representations for probabilistic information. These results indicate that implementing high quality, instance-level UQ for AI may improve decision-making with real systems compared to AI predictions alone.

Using AI Uncertainty Quantification to Improve Human Decision-Making

TL;DR

This work probes whether well-calibrated instance-level AI Uncertainty Quantification (UQ) can enhance human decision-making beyond AI predictions. It introduces a sampling-based UQ method with ground-truth calibration verified via a strict scoring rule and evaluates its effect through two preregistered online experiments across multiple datasets. Experiment 1 shows that AI UQ improves decision accuracy and confidence calibration over AI predictions alone, while Experiment 2 finds no robust differences across different uncertainty visualizations or representations. Collectively, the results support the value of high-quality, instance-level UQ for human-AI interaction and highlight that benefits generalize across representations, offering a path toward more reliable AI-assisted decision-making in real systems.

Abstract

AI Uncertainty Quantification (UQ) has the potential to improve human decision-making beyond AI predictions alone by providing additional probabilistic information to users. The majority of past research on AI and human decision-making has concentrated on model explainability and interpretability, with little focus on understanding the potential impact of UQ on human decision-making. We evaluated the impact on human decision-making for instance-level UQ, calibrated using a strict scoring rule, in two online behavioral experiments. In the first experiment, our results showed that UQ was beneficial for decision-making performance compared to only AI predictions. In the second experiment, we found UQ had generalizable benefits for decision-making across a variety of representations for probabilistic information. These results indicate that implementing high quality, instance-level UQ for AI may improve decision-making with real systems compared to AI predictions alone.
Paper Structure (18 sections, 2 equations, 7 figures, 1 table)

This paper contains 18 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The Brier score of the "cloned" instances for Census (100), German Credit (100), and Student Performance (40) sampled for demonstration. $Y$-axis is the Brier score. Magenta marks the samples that are correctly predicted by the AI model, cyan marks samples incorrectly predicted by the model. Horizontal lines illustrate the mean of the Brier score, and its 0.5, 1, and 1.5 standard deviations.
  • Figure 2: Example showing the information appearing in the three AI conditions in Experiment 1 for a trial from the German Credit dataset condition.
  • Figure 3: Participant accuracy in Experiments 1 (left) and 2 (right). Error bars represent 95% confidence intervals.
  • Figure 4: Predicted effects (level 2/overall results of multilevel model) of confidence ratings, dataset, and AI condition upon accuracy in Experiment 1. Steeper, positively-sloped lines indicate better confidence calibration. Shaded areas represent 95% confidence intervals for the predicted values.
  • Figure 5: Participant response times (RT) in milliseconds in Experiment 1 (left) and Experiment 2 (right). Error bars represent 95% confidence intervals.
  • ...and 2 more figures