Table of Contents
Fetching ...

Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis

Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan

TL;DR

This work investigates whether domain-expert radiologists can reliably rely on AI for prostate cancer diagnosis from MRI by conducting two pre-registered experiments with distinct AI-exposure workflows. Using an nnU-Net-based predictor and a custom web interface, the study compares human, AI, and human+AI performance across per-patient and per-lesion metrics, including an upfront AI input condition and a performance-feedback condition. The results show AI alone consistently outperforms humans, but human+AI teams often underperform AI due to under-reliance, though ensemble approaches can achieve complementary performance and sometimes surpass AI. Performance feedback and upfront AI input modulate AI adoption but do not fully close the gap, underscoring the need to refine human-AI collaboration to maximize clinical impact.

Abstract

Despite the growing interest in human-AI decision making, experimental studies with domain experts remain rare, largely due to the complexity of working with domain experts and the challenges in setting up realistic experiments. In this work, we conduct an in-depth collaboration with radiologists in prostate cancer diagnosis based on MRI images. Building on existing tools for teaching prostate cancer diagnosis, we develop an interface and conduct two experiments to study how AI assistance and performance feedback shape the decision making of domain experts. In Study 1, clinicians were asked to provide an initial diagnosis (human), then view the AI's prediction, and subsequently finalize their decision (human-AI team). In Study 2 (after a memory wash-out period), the same participants first received aggregated performance statistics from Study 1, specifically their own performance, the AI's performance, and their human-AI team performance, and then directly viewed the AI's prediction before making their diagnosis (i.e., no independent initial diagnosis). These two workflows represent realistic ways that clinical AI tools might be used in practice, where the second study simulates a scenario where doctors can adjust their reliance and trust on AI based on prior performance feedback. Our findings show that, while human-AI teams consistently outperform humans alone, they still underperform the AI due to under-reliance, similar to prior studies with crowdworkers. Providing clinicians with performance feedback did not significantly improve the performance of human-AI teams, although showing AI decisions in advance nudges people to follow AI more. Meanwhile, we observe that the ensemble of human-AI teams can outperform AI alone, suggesting promising directions for human-AI collaboration.

Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis

TL;DR

This work investigates whether domain-expert radiologists can reliably rely on AI for prostate cancer diagnosis from MRI by conducting two pre-registered experiments with distinct AI-exposure workflows. Using an nnU-Net-based predictor and a custom web interface, the study compares human, AI, and human+AI performance across per-patient and per-lesion metrics, including an upfront AI input condition and a performance-feedback condition. The results show AI alone consistently outperforms humans, but human+AI teams often underperform AI due to under-reliance, though ensemble approaches can achieve complementary performance and sometimes surpass AI. Performance feedback and upfront AI input modulate AI adoption but do not fully close the gap, underscoring the need to refine human-AI collaboration to maximize clinical impact.

Abstract

Despite the growing interest in human-AI decision making, experimental studies with domain experts remain rare, largely due to the complexity of working with domain experts and the challenges in setting up realistic experiments. In this work, we conduct an in-depth collaboration with radiologists in prostate cancer diagnosis based on MRI images. Building on existing tools for teaching prostate cancer diagnosis, we develop an interface and conduct two experiments to study how AI assistance and performance feedback shape the decision making of domain experts. In Study 1, clinicians were asked to provide an initial diagnosis (human), then view the AI's prediction, and subsequently finalize their decision (human-AI team). In Study 2 (after a memory wash-out period), the same participants first received aggregated performance statistics from Study 1, specifically their own performance, the AI's performance, and their human-AI team performance, and then directly viewed the AI's prediction before making their diagnosis (i.e., no independent initial diagnosis). These two workflows represent realistic ways that clinical AI tools might be used in practice, where the second study simulates a scenario where doctors can adjust their reliance and trust on AI based on prior performance feedback. Our findings show that, while human-AI teams consistently outperform humans alone, they still underperform the AI due to under-reliance, similar to prior studies with crowdworkers. Providing clinicians with performance feedback did not significantly improve the performance of human-AI teams, although showing AI decisions in advance nudges people to follow AI more. Meanwhile, we observe that the ensemble of human-AI teams can outperform AI alone, suggesting promising directions for human-AI collaboration.

Paper Structure

This paper contains 18 sections, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Overview of our experiments with radiologists. In study 1, participant radiologists (N=8) reviewed 75 cases in three steps: initial independent diagnosis, review of AI predictions, and final diagnosis. In study 2, we introduce performance feedback to communicate individual radiologist's performance collected from study 1 before the study. Then they reviewed 100 cases with direct AI assistance without independent diagnosis.
  • Figure 2: Screenshots of the webapp interface for our human study. (a) \ref{['fig:study1_interface']} presents a user interface for patient case evaluation. An AI lesion prediction is highlighted with a red contour in the T2W sequence. On the right, the user's current prediction is shown as "No Cancer," and they are at the stage of evaluating the AI prediction to make a final diagnosis. (b) \ref{['fig:annotate']} shows the user interface of the Annotation Panel. The screenshot shows a current annotation of the user. The user can clear the annotation or add new annotations on the canvas. (c) \ref{['fig:feedback']} illustrates an example performance feedback page presented to a user before proceeding to Study 2. The page provides a summary of the total number of cases, including counts of correct and incorrect cases, the number of decision changes influenced by AI advice, and whether those changes were correct or incorrect. It also highlights key performance metrics such as accuracy, sensitivity, and specificity, derived from Study 1. To ensure users review the information carefully, they are required to answer attention check questions.
  • Figure 3: An example of lesion-level annotation comparing human experts (red contour), AI (yellow contour), and expert annotation from the dataset (green contour). In this case, the AI successfully detected a lesion which corresponded to a clinically significant prostate cancer in the dataset; our human radiologist did not identify this lesion, and instead annotated a lesion in the transition zone.
  • Figure 4: Individual radiologists performance compared with the AI model. The model achieves higher performance than all of the radiologists without AI assistance (blue dots). However, with AI assistance, some individual radiologists outperformed the AI model (red and orange dots that are above the curve).
  • Figure 5: Mean performance of Human-alone, Human+AI, Human-ensemble, Human+AI-ensemble, and AI in Study 1. AUROC, accuracy, specificity, NPV, and PPV are significantly better in Human-ensemble than in Human-alone. In Human+AI-ensemble, AUROC, accuracy, and PPV are significantly better than that of AI.
  • ...and 7 more figures