Table of Contents
Fetching ...

Establishing Rigorous and Cost-effective Clinical Trials for Artificial Intelligence Models

Wanling Gao, Yunyou Huang, Dandan Cui, Zhuoming Yu, Wenjing Liu, Xiaoshuang Liang, Jiahui Zhao, Jiyue Xie, Hao Li, Li Ma, Ning Ye, Yumiao Kang, Dingfeng Luo, Peng Pan, Wei Huang, Zhongmou Liu, Jizhong Hu, Gangyuan Zhao, Chongrong Jiang, Fan Huang, Tianyi Wei, Suqin Tang, Bingjie Xia, Zhifei Zhang, Jianfeng Zhan

TL;DR

The paper addresses the gap between AI development and clinical practice by highlighting the inadequacy of traditional evaluations and proposing DC-AI RCTs and VC-MedAI as rigorous, cost-effective alternatives. They implement a two-step DC-AI RCT across 14 centers with 125 clinicians and $7500$ diagnosis records, and develop VC-MedAI as a preclinical-like in-silico trial framework that mirrors prospective trials. Results show that DC-AI RCTs reveal substantial interactions between clinicians and AI, with invisible random models improving $AUC$ by $3.37$ percentage points and AI models giving $AUC$ gains from $1.95$ to $10.9$ percentage points; VC-MedAI specialized simulator achieves $AUC$ around $0.81$–$0.82$, while generalized simulator reaches about $0.85$, and VC-MedAI provides roughly $150$-fold speedups in evaluating new AI tools. The study argues these methods can accelerate safe, iterative AI integration into practice and guide future AI development with clinician collaboration and demographic representativeness.

Abstract

A profound gap persists between artificial intelligence (AI) and clinical practice in medicine, primarily due to the lack of rigorous and cost-effective evaluation methodologies. State-of-the-art and state-of-the-practice AI model evaluations are limited to laboratory studies on medical datasets or direct clinical trials with no or solely patient-centered controls. Moreover, the crucial role of clinicians in collaborating with AI, pivotal for determining its impact on clinical practice, is often overlooked. For the first time, we emphasize the critical necessity for rigorous and cost-effective evaluation methodologies for AI models in clinical practice, featuring patient/clinician-centered (dual-centered) AI randomized controlled trials (DC-AI RCTs) and virtual clinician-based in-silico trials (VC-MedAI) as an effective proxy for DC-AI RCTs. Leveraging 7500 diagnosis records from two-step inaugural DC-AI RCTs across 14 medical centers with 125 clinicians, our results demonstrate the necessity of DC-AI RCTs and the effectiveness of VC-MedAI. Notably, VC-MedAI performs comparably to human clinicians, replicating insights and conclusions from prospective DC-AI RCTs. We envision DC-AI RCTs and VC-MedAI as pivotal advancements, presenting innovative and transformative evaluation methodologies for AI models in clinical practice, offering a preclinical-like setting mirroring conventional medicine, and reshaping development paradigms in a cost-effective and fast-iterative manner. Chinese Clinical Trial Registration: ChiCTR2400086816.

Establishing Rigorous and Cost-effective Clinical Trials for Artificial Intelligence Models

TL;DR

The paper addresses the gap between AI development and clinical practice by highlighting the inadequacy of traditional evaluations and proposing DC-AI RCTs and VC-MedAI as rigorous, cost-effective alternatives. They implement a two-step DC-AI RCT across 14 centers with 125 clinicians and diagnosis records, and develop VC-MedAI as a preclinical-like in-silico trial framework that mirrors prospective trials. Results show that DC-AI RCTs reveal substantial interactions between clinicians and AI, with invisible random models improving by percentage points and AI models giving gains from to percentage points; VC-MedAI specialized simulator achieves around , while generalized simulator reaches about , and VC-MedAI provides roughly -fold speedups in evaluating new AI tools. The study argues these methods can accelerate safe, iterative AI integration into practice and guide future AI development with clinician collaboration and demographic representativeness.

Abstract

A profound gap persists between artificial intelligence (AI) and clinical practice in medicine, primarily due to the lack of rigorous and cost-effective evaluation methodologies. State-of-the-art and state-of-the-practice AI model evaluations are limited to laboratory studies on medical datasets or direct clinical trials with no or solely patient-centered controls. Moreover, the crucial role of clinicians in collaborating with AI, pivotal for determining its impact on clinical practice, is often overlooked. For the first time, we emphasize the critical necessity for rigorous and cost-effective evaluation methodologies for AI models in clinical practice, featuring patient/clinician-centered (dual-centered) AI randomized controlled trials (DC-AI RCTs) and virtual clinician-based in-silico trials (VC-MedAI) as an effective proxy for DC-AI RCTs. Leveraging 7500 diagnosis records from two-step inaugural DC-AI RCTs across 14 medical centers with 125 clinicians, our results demonstrate the necessity of DC-AI RCTs and the effectiveness of VC-MedAI. Notably, VC-MedAI performs comparably to human clinicians, replicating insights and conclusions from prospective DC-AI RCTs. We envision DC-AI RCTs and VC-MedAI as pivotal advancements, presenting innovative and transformative evaluation methodologies for AI models in clinical practice, offering a preclinical-like setting mirroring conventional medicine, and reshaping development paradigms in a cost-effective and fast-iterative manner. Chinese Clinical Trial Registration: ChiCTR2400086816.
Paper Structure (5 sections, 5 figures, 1 table)

This paper contains 5 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Transforming Clinical Trials for AI Models in Clinicial Practice. (a) and (b): patient-centered trials for conventional medicine evaluation and AI model evaluation in clinical practice. (c) rigorous and cost-effective trials for AI model evaluation in clinical practice: laboratory, preclinical-like trials using in-silico VC-MedAI, and dual-centered AI randomised controlled trials (DC-AI RCTs).
  • Figure 2: Workflow of VC-MedAI In-silico Trials. VC-MedAI is constructed and modeled from Step #1 DC-AI RCTs, containing virtual clinician generator and clinician behavior simulator. We evaluate the effectiveness of VC-MedAI prospectively compared to Step #2 DC-AI RCTs.
  • Figure 3: VC-MedAI Behaves Similar with Human Clinicians Compared to Prospective Step #2 DC-AI RCTs. (a) and (b): the averaged preliminary and final (two-stage) diagnosis accuracy comparisons between VC-MedAI specialized simulator and clinicians. (c) and (d): two-stage accuracy comparisons between VC-MedAI generalized simulator and clinicians. (e) and (f): two-stage time comparisons between VC-MedAI specialized simulator and clinicians. (g) and (h): two-stage time comparisons between VC-MedAI generalized simulator and clinicians. The breakdown comparisons of O (Overall), AI (AI models), S (Sex), A (Age), I (Institution), W (Years of Working), P (Class of Position), D (Department), and PT (Patient Type) are arranged from inside to outside within each corresponding sector as shown in the legend. Generalized simulator has no PT sector for general simulation. (i) and (j) are specific diagnosis examples of ten clinicians and corresponding VC-MedAI with the same features. (i): diagnosis accuracy. (j): diagnosis behavior (click sequence of examination items). Each sequence from Seq1 to Seq20 contains the operations of a human clinician (left) and corresponding virtual one (right). Note that history fundamental examination items include body temperature, systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, consciousness level, and qSOFA (quick Sequential Organ Failure Assessment).
  • Figure 4: VC-MedAI supports the discovery of consistent outcomes with real-world clinical trials. (a) and (b): the averaged preliminary and final diagnosis (two-stage) accuracy comparisons between human clinical and in-silico trials (specialized). (c) and (d): two-stage diagnosis accuracy comparisons between human clinical and in-silico trials (generalized). (e) and (f): two-stage diagnosis time comparisons between human clinical and in-silico trials (specialized). (g) and (h): two-stage diagnosis time comparisons between human clinical and in-silico trials (generalized). Note that the breakdown comparisons of O, AI, S, A, I, W, P, D, and PT are arranged from inside to outside within each corresponding sector. (i) and (j) represent the time reduction for early detection relative to the onset time of sepsis patients, which is a primary clinical phenomenon related to clinical outcome. (i): preliminary and (j): final diagnosis. YS, EW, and NS indicate the model's prediction as sepsis, early warning of sepsis, and non-sepsis, respectively. A larger value signifies larger time reduction and earlier detection of sepsis.
  • Figure 5: The Design and Implementation of VC-MedAI. Based on Step #1 DC-AI RCTs, VC-MedAI identifies a series of features in terms of clinician, AI model, and patient properties. The VC-MedAI generator generates user-defined number of virtual clinicians and reflect similar features with real-world human clinician population. VC-MedAI simulator receives feature input and simulates the operation behaviors, diagnosis decision, and time consumption during preliminary diagnosis stage, and outputs the final diagnosis and time consumption during final diagnosis stage.