Table of Contents
Fetching ...

"Are You Really Sure?" Understanding the Effects of Human Self-Confidence Calibration in AI-Assisted Decision Making

Shuai Ma, Xinru Wang, Ying Lei, Chuhan Shi, Ming Yin, Xiaojuan Ma

TL;DR

The paper investigates how calibrating human self-confidence affects rationality and performance in AI-assisted decision-making. It introduces a Confidence-Correctness Matching framework combining human and AI confidence signals to diagnose inappropriate reliance and evaluates three calibration mechanisms (Think Opposite, Thinking in Bets, and Calibration Status Feedback) across multiple income-prediction studies. Findings show that human self-confidence calibration can improve task performance and reliance appropriateness in many settings, but benefits can hinge on AI confidence alignment; misalignment can produce adverse effects. The work offers design recommendations and a roadmap for integrating human confidence calibration into future AI-assisted interfaces, highlighting both practical benefits and limitations. Overall, it provides a nuanced, multi-study understanding of how calibrated human confidence shapes collaboration with probabilistic AI systems.

Abstract

In AI-assisted decision-making, it is crucial but challenging for humans to achieve appropriate reliance on AI. This paper approaches this problem from a human-centered perspective, "human self-confidence calibration". We begin by proposing an analytical framework to highlight the importance of calibrated human self-confidence. In our first study, we explore the relationship between human self-confidence appropriateness and reliance appropriateness. Then in our second study, We propose three calibration mechanisms and compare their effects on humans' self-confidence and user experience. Subsequently, our third study investigates the effects of self-confidence calibration on AI-assisted decision-making. Results show that calibrating human self-confidence enhances human-AI team performance and encourages more rational reliance on AI (in some aspects) compared to uncalibrated baselines. Finally, we discuss our main findings and provide implications for designing future AI-assisted decision-making interfaces.

"Are You Really Sure?" Understanding the Effects of Human Self-Confidence Calibration in AI-Assisted Decision Making

TL;DR

The paper investigates how calibrating human self-confidence affects rationality and performance in AI-assisted decision-making. It introduces a Confidence-Correctness Matching framework combining human and AI confidence signals to diagnose inappropriate reliance and evaluates three calibration mechanisms (Think Opposite, Thinking in Bets, and Calibration Status Feedback) across multiple income-prediction studies. Findings show that human self-confidence calibration can improve task performance and reliance appropriateness in many settings, but benefits can hinge on AI confidence alignment; misalignment can produce adverse effects. The work offers design recommendations and a roadmap for integrating human confidence calibration into future AI-assisted interfaces, highlighting both practical benefits and limitations. Overall, it provides a nuanced, multi-study understanding of how calibrated human confidence shapes collaboration with probabilistic AI systems.

Abstract

In AI-assisted decision-making, it is crucial but challenging for humans to achieve appropriate reliance on AI. This paper approaches this problem from a human-centered perspective, "human self-confidence calibration". We begin by proposing an analytical framework to highlight the importance of calibrated human self-confidence. In our first study, we explore the relationship between human self-confidence appropriateness and reliance appropriateness. Then in our second study, We propose three calibration mechanisms and compare their effects on humans' self-confidence and user experience. Subsequently, our third study investigates the effects of self-confidence calibration on AI-assisted decision-making. Results show that calibrating human self-confidence enhances human-AI team performance and encourages more rational reliance on AI (in some aspects) compared to uncalibrated baselines. Finally, we discuss our main findings and provide implications for designing future AI-assisted decision-making interfaces.
Paper Structure (71 sections, 11 equations, 13 figures, 2 tables)

This paper contains 71 sections, 11 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Reliability diagrams for a binary classification task guo2017calibration, illustrating calibrated confidence (left, the actual accuracy aligns with the stated confidence), over-confidence (middle, the actual accuracy falls below the stated confidence), and under-confidence (right, the actual accuracy is above the stated confidence).
  • Figure 2: A space of different combinations of 1) initial human prediction correctness and confidence, 2) AI suggestion correctness and its confidence, and 3) human final decision correctness, at a task instance level. To save space, we only highlight situations where a human's initial prediction differs from the AI's suggestion and the human's final decision is incorrect. Comparing (a) and (b), (a) may induce more incorrect AI reliance due to Human C-C Mismatched (Low&Correct). Similarly, (c) may lead to more incorrect self-reliance due to Human C-C Mismatched (High&Incorrect).
  • Figure 3: The interface and procedure for making a prediction on a task instance.
  • Figure 4: An analysis of error rate in different human and AI Confidence-Correctness Matching situations. The left shows the four categories considering both human and AI C-C Matching. The right shows the two categories only considering human C-C Matching no matter whether AI is C-C Matched or not. Error bars indicate standard errors. (*: $p$ < 0.05; **: $p$ < 0.01; ***: $p$ < 0.001)
  • Figure 5: Interfaces of different self-confidence calibration conditions. (A) Think the Opposite. (B) Thinking in Bets. (C) Calibration Status Feedback contains two views, (1) real-time feedback during the decision-making process and (2) post-hoc feedback after a batch of decision tasks.
  • ...and 8 more figures