A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Peter Brodeur; Jacob M. Koshy; Anil Palepu; Khaled Saab; Ava Homiar; Roma Ruparel; Charles Wu; Ryutaro Tanno; Joseph Xu; Amy Wang; David Stutz; Hannah M. Ferrera; David Barrett; Lindsey Crowley; Jihyeon Lee; Spencer E. Rittner; Ellery Wulczyn; Selena K. Zhang; Elahe Vedadi; Christine G. Kohn; Kavita Kulkarni; Vinay Kadiyala; Sara Mahdavi; Wendy Du; Jessica Williams; David Feinbloom; Renee Wong; Tao Tu; Petar Sirkovic; Alessio Orlandi; Christopher Semturs; Yun Liu; Juraj Gottweis; Dale R. Webster; Joëlle Barral; Katherine Chou; Pushmeet Kohli; Avinatan Hassidim; Yossi Matias; James Manyika; Rob Fields; Jonathan X. Li; Marc L. Cohen; Vivek Natarajan; Mike Schaekermann; Alan Karthikesalingam; Adam Rodman

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Joëlle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Mike Schaekermann, Alan Karthikesalingam, Adam Rodman

TL;DR

This study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

Abstract

Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

TL;DR

This study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

Abstract

Paper Structure (49 sections, 15 figures, 17 tables)

This paper contains 49 sections, 15 figures, 17 tables.

Introduction
AI System
Methods
Study Setting and Oversight
Patient Eligibility
Patient and PCP Recruitment
Intervention Protocol
Safety Pilot
Semi-Structured Interviews
Data Security
Infrastructure Challenges
Outcome Measures
Primary Outcomes
Secondary Outcomes
Exploratory outcomes
...and 34 more sections

Figures (15)

Figure 1: Study design. For each patient, the study flow consisted of three distinct steps: (1) The patient conversed with AMIE through a synchronous text chat interface, while an AI supervisor observed the patient-AMIE chat on a video-call with screen-sharing to ensure patient safety; (2) the patient saw a primary care physician (PCP) in person or via telehealth up to five days after the AMIE chat, with the PCP having access to the AMIE chat transcript and summary; (3) eight weeks after the provider visit, the final diagnosis was extracted from the patient's chart and three independent clinical evaluators rated the chat conversation quality as well as the differential diagnosis and management plan from both AMIE and PCP in a blinded and randomized manner.
Figure 2: Clinical Reasoning Performance. Clinical evaluators rated the quality of management plans and differential diagnoses of both AMIE and PCPs in a blinded randomized manner. For each patient case, ratings were aggregated as the median rating across a panel of three independent clinical evaluators. (A) Comparative ratings from clinical evaluators assessed the quality of two candidate management plans and differential diagnoses in a side-by-side manner relative to each other; error bars for comparative ratings represent 95% confidence intervals for binomial proportions (N=98) for expressing a preference ('slightly better' or 'much better') for either of the two candidates. (B) Pointwise ratings from clinical evaluators assessed the quality of each management plan and differential diagnosis individually on a 5-point Likert scale. For pointwise ratings, asterisks represent statistical significance per two-sided Wilcoxon signed-rank tests with Bonferroni correction ($**:p<0.01$, $n.s.:$ not significant). In addition to ratings from clinical evaluators, we measured (C) AMIE's Top-k diagnostic accuracy as compared to the final diagnosis extracted for each patient via chart review eight weeks after their PCP visit. In addition to overall accuracy across all patients (N=98), we provide accuracy for the subset of patients where the final diagnosis was confirmed by a diagnostic test such as imaging, microbiology, laboratory, pathology, EKG (N=46), and the subset where this was not the case, i.e., where the diagnosis was presumptive without a diagnostic test, irrespective of whether the diagnosis was made by a PCP or specialist (N=52). Error bars for diagnostic accuracy correspond to 95% confidence intervals for binomial proportions.
Figure 3: Effect on patient attitudes towards AI. Patients completed the General Attitudes towards AI Scale (GAAIS) prior to interacting with AMIE (Pre-AI), after interacting with AMIE (Post-AI), and after the urgent care consultation with the provider (Post-Provider). The GAAIS scale includes two sub-scales corresponding to (1) perceived utility and (2) concerns around AI. Attitudes shifted more positive after interacting with AMIE and remained at an elevated level after seeing the PCP. This change was statistically significant for both sub-scales and the overall scale as confirmed by a Friedman omnibus test followed by pairwise Wilcoxon post-hoc tests (p < 0.001 for omnibus and pairwise tests).
Figure 4: AMIE Conversation Quality. The quality of AMIE conversations was rated from patient and clinician perspectives. Patient perspectives were collected through surveys immediately after patients completed their interaction with AMIE and the safety debrief with the AI supervisor. Clinician perspectives were rated post-hoc by a panel of three independent clinical evaluators per patient case, and aggregated using the median across the three ratings per case.
Figure A.1: AMIE Top-k Diagnostic Accuracy. Percentage of cases where a the final diagnosis, per chart review 8 weeks post-encounter, was present in the top-k items of AMIE's differential. Top-left: All patient cases. Top-right: Subgroup analysis based on whether the final diagnosis was the PCP's presumptive diagnosis without any specialist follow-up or diagnostic test involved (N=35). Bottom-left: Subgroup analysis based on whether a specialist follow-up was involved in establishing the final diagnosis (N=31). Bottom-right: Subgroup analysis based on whether the final diagnosis was confirmed by a diagnostic test such as laboratory, microbiological, pathological, or imaging (N=46), regardless of whether a specialist follow-up was involved or not. Shaded error bars correspond to 95% confidence intervals for binomial proportions.
...and 10 more figures

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

TL;DR

Abstract

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Authors

TL;DR

Abstract

Table of Contents

Figures (15)