Table of Contents
Fetching ...

Higher Stakes, Healthier Trust? An Application-Grounded Approach to Assessing Healthy Trust in High-Stakes Human-AI Collaboration

David S. Johnson

TL;DR

The paper tackles the challenge of evaluating healthy trust in high-stakes human-AI collaboration by introducing Blockies, a parametric, bias-controlled dataset generator, and an application-grounded evaluation framework for scalable online studies. It combines simulated diagnostic tasks with a real black-box model and storytelling-based stake manipulation to study how perceived stakes affect decision time and trust. An empirical study shows that high-stakes conditions increase cognitive effort but can lead to greater reliance on incorrect AI recommendations, reducing healthy distrust in some contexts. Overall, the framework enables reproducible, large-scale comparative evaluations of explainability and H-AI collaboration methods in realistic decision-making scenarios.

Abstract

Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, yet its benefits have not been fully realized. Application-grounded evaluations are needed to better evaluate methods for improving collaboration but often require domain experts, making studies costly and limiting their generalizability. Current evaluation methods are constrained by limited public datasets and reliance on proxy tasks. To address these challenges, we propose an application-grounded framework for large-scale, online evaluations of vision-based decision-making tasks. The framework introduces Blockies, a parametric approach for generating datasets of simulated diagnostic tasks, offering control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation by non-experts. The framework also incorporates storytelling and monetary incentives to manipulate perceived task stakes. An initial empirical study demonstrated that the high-stakes condition significantly reduced healthy distrust of AI, despite longer decision-making times. These findings underscore the importance of perceived stakes in fostering healthy distrust and demonstrate the framework's potential for scalable evaluation of high-stakes Human-AI collaboration.

Higher Stakes, Healthier Trust? An Application-Grounded Approach to Assessing Healthy Trust in High-Stakes Human-AI Collaboration

TL;DR

The paper tackles the challenge of evaluating healthy trust in high-stakes human-AI collaboration by introducing Blockies, a parametric, bias-controlled dataset generator, and an application-grounded evaluation framework for scalable online studies. It combines simulated diagnostic tasks with a real black-box model and storytelling-based stake manipulation to study how perceived stakes affect decision time and trust. An empirical study shows that high-stakes conditions increase cognitive effort but can lead to greater reliance on incorrect AI recommendations, reducing healthy distrust in some contexts. Overall, the framework enables reproducible, large-scale comparative evaluations of explainability and H-AI collaboration methods in realistic decision-making scenarios.

Abstract

Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, yet its benefits have not been fully realized. Application-grounded evaluations are needed to better evaluate methods for improving collaboration but often require domain experts, making studies costly and limiting their generalizability. Current evaluation methods are constrained by limited public datasets and reliance on proxy tasks. To address these challenges, we propose an application-grounded framework for large-scale, online evaluations of vision-based decision-making tasks. The framework introduces Blockies, a parametric approach for generating datasets of simulated diagnostic tasks, offering control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation by non-experts. The framework also incorporates storytelling and monetary incentives to manipulate perceived task stakes. An initial empirical study demonstrated that the high-stakes condition significantly reduced healthy distrust of AI, despite longer decision-making times. These findings underscore the importance of perceived stakes in fostering healthy distrust and demonstrate the framework's potential for scalable evaluation of high-stakes Human-AI collaboration.

Paper Structure

This paper contains 24 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The Blockies framework for evaluating high-stakes Human-AI collaboration. The framework is designed around diagnosing OCDegen in virtual creatures called Blockies. The top right shows examples of healthy Blockies, and examples of Blockies with OCDegen are on the bottom right. The diagnostic task (bottom left) allows researchers to vary the information provided during Human-AI collaboration based on the goals of the study. The user interface used in our initial study with this framework is on the top left.
  • Figure 2: Human-AI collaboration performance for Accuracy, Agreement, Healthy Distrust, and Healthy Trust. Baseline illustrates how participants performed without AI support, while AI Supported indicates performance with AI-recommended diagnoses. Statistical significance is shown for changes that are significant.
  • Figure 3: Decision time is the average time it took for participants to make a decision. Statistical significance is shown for changes in decision time that are significant.
  • Figure 4: Distribution of four main traits used to diagnose OCDegen.Bone Shape represents shape of the main bones, typically ranging between 0 and 1. Value above 1 are a biomarker of OCDegen, resulting in a more of a pointed shaped when rendered. Sphere Diff measures the variation between main and secondary bone shapes, with stronger variation indicating OCDegen. Spine Bending represents the degree of bend in a Blocky's posture. where stronger bending is a biomarker of OCDegen. Leg Position ranges from 0 to 1, with values below 0.5 indicating legs pulled back and head peaking out, while values above 0.5 indicated extended legs. Extended legs are are biomarker of OCDegen.