Table of Contents
Fetching ...

The Framework That Survives Bad Models: Human-AI Collaboration For Clinical Trials

Yao Chen, David Ohlssen, Aimee Readie, Gregory Ligozio, Ruvie Martin, Thibaud Coroller

TL;DR

Clinical trial endpoints derived from imaging are vulnerable to biased AI assessments; the authors compare three AI-enabled workflows against human double reading using two randomized spinal X‑ray trials. They stress-test with degraded models (random and naive) to ensure observed treatment effects remain valid. Their results show AI‑SR offers the best combination of accuracy, robustness, and cost‑efficiency, preserving treatment effect estimates and aligning with HDR conclusions, even in the PREVENT generalization case. This work provides a practical framework for safely integrating AI into trial workflows across populations, with implications for faster, cheaper, and more robust $mSASSS$-based endpoint evaluation.

Abstract

Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.

The Framework That Survives Bad Models: Human-AI Collaboration For Clinical Trials

TL;DR

Clinical trial endpoints derived from imaging are vulnerable to biased AI assessments; the authors compare three AI-enabled workflows against human double reading using two randomized spinal X‑ray trials. They stress-test with degraded models (random and naive) to ensure observed treatment effects remain valid. Their results show AI‑SR offers the best combination of accuracy, robustness, and cost‑efficiency, preserving treatment effect estimates and aligning with HDR conclusions, even in the PREVENT generalization case. This work provides a practical framework for safely integrating AI into trial workflows across populations, with implications for faster, cheaper, and more robust -based endpoint evaluation.

Abstract

Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.

Paper Structure

This paper contains 11 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: mSASSS definition and distribution in the dataset. mSASSS scoring criteria (0–3) based on structural changes at vertebral corners, alongside their frequency in the dataset. Most corners are normal (score 0, 81%), while erosion/sclerosis/squaring (score 1), syndesmophytes (score 2), and bridging (score 3) are less common (3%, 5%, and 11%, respectively).
  • Figure 2: Introducing the framework implementation for the experiment, where we compare the gold standard, human double reader (HDR) to two AI approaches, namely AI as independent (IR) or supporting reader (SR) to evaluate spinal disease status (modified Stoke Ankylosing Spondylitis Spinal Scores - mSASSS) from X-ray images. At the end of each framework, the scores from each reader are pooled to obtain the consensus mSASSS. One key difference between the two AI frameworks, is that AI-IR includes the AI score for the consensus mSASSS, while AI-SR does not.
  • Figure 3: End to end pipeline to extract vertebral units (VUs) and mSASSS classification
  • Figure 4: Cost difference between HDR, AI-IR and AI-SR when using extreme cases in prediction (naive and random model vs trained).
  • Figure 5: Impact of relative cost of arbitration w.r.t. the first human reader.
  • ...and 5 more figures