Table of Contents
Fetching ...

Conversational Medical AI: Ready for Practice

Antoine Lizée, Pierre-Auguste Beaucoté, James Whitbeck, Marion Doumeingts, Anaël Beaugnon, Isabelle Feldhaus

TL;DR

This study investigates Mo, a physician-supervised LLM-based conversational agent integrated into Alan’s real-world medical chat service, addressing the urgent issue of physician shortages. Using a randomized controlled design (n≈926 eligible conversations), the authors demonstrate that AI-assisted conversations yield higher information clarity and overall satisfaction without compromising trust or perceived empathy, with strong safety oversight evidenced by GP evaluations. Mo’s development relies on a multi-agent framework and a rigorous offline evaluation pipeline, including a French medical knowledge benchmark, real-world anonymized chats, and simulated patient dialogues to optimize performance and end-to-end dialogue capabilities. The findings suggest AI augmentation can enhance patient experience while preserving safety, offering practical guidance for implementing AI in healthcare communications and informing future research on long-term outcomes, system integration, and privacy protections.

Abstract

The shortage of doctors is creating a critical squeeze in access to medical expertise. While conversational Artificial Intelligence (AI) holds promise in addressing this problem, its safe deployment in patient-facing roles remains largely unexplored in real-world medical settings. We present the first large-scale evaluation of a physician-supervised LLM-based conversational agent in a real-world medical setting. Our agent, Mo, was integrated into an existing medical advice chat service. Over a three-week period, we conducted a randomized controlled experiment with 926 cases to evaluate patient experience and satisfaction. Among these, Mo handled 298 complete patient interactions, for which we report physician-assessed measures of safety and medical accuracy. Patients reported higher clarity of information (3.73 vs 3.62 out of 4, p < 0.05) and overall satisfaction (4.58 vs 4.42 out of 5, p < 0.05) with AI-assisted conversations compared to standard care, while showing equivalent levels of trust and perceived empathy. The high opt-in rate (81% among respondents) exceeded previous benchmarks for AI acceptance in healthcare. Physician oversight ensured safety, with 95% of conversations rated as "good" or "excellent" by general practitioners experienced in operating a medical advice chat service. Our findings demonstrate that carefully implemented AI medical assistants can enhance patient experience while maintaining safety standards through physician supervision. This work provides empirical evidence for the feasibility of AI deployment in healthcare communication and insights into the requirements for successful integration into existing healthcare services.

Conversational Medical AI: Ready for Practice

TL;DR

This study investigates Mo, a physician-supervised LLM-based conversational agent integrated into Alan’s real-world medical chat service, addressing the urgent issue of physician shortages. Using a randomized controlled design (n≈926 eligible conversations), the authors demonstrate that AI-assisted conversations yield higher information clarity and overall satisfaction without compromising trust or perceived empathy, with strong safety oversight evidenced by GP evaluations. Mo’s development relies on a multi-agent framework and a rigorous offline evaluation pipeline, including a French medical knowledge benchmark, real-world anonymized chats, and simulated patient dialogues to optimize performance and end-to-end dialogue capabilities. The findings suggest AI augmentation can enhance patient experience while preserving safety, offering practical guidance for implementing AI in healthcare communications and informing future research on long-term outcomes, system integration, and privacy protections.

Abstract

The shortage of doctors is creating a critical squeeze in access to medical expertise. While conversational Artificial Intelligence (AI) holds promise in addressing this problem, its safe deployment in patient-facing roles remains largely unexplored in real-world medical settings. We present the first large-scale evaluation of a physician-supervised LLM-based conversational agent in a real-world medical setting. Our agent, Mo, was integrated into an existing medical advice chat service. Over a three-week period, we conducted a randomized controlled experiment with 926 cases to evaluate patient experience and satisfaction. Among these, Mo handled 298 complete patient interactions, for which we report physician-assessed measures of safety and medical accuracy. Patients reported higher clarity of information (3.73 vs 3.62 out of 4, p < 0.05) and overall satisfaction (4.58 vs 4.42 out of 5, p < 0.05) with AI-assisted conversations compared to standard care, while showing equivalent levels of trust and perceived empathy. The high opt-in rate (81% among respondents) exceeded previous benchmarks for AI acceptance in healthcare. Physician oversight ensured safety, with 95% of conversations rated as "good" or "excellent" by general practitioners experienced in operating a medical advice chat service. Our findings demonstrate that carefully implemented AI medical assistants can enhance patient experience while maintaining safety standards through physician supervision. This work provides empirical evidence for the feasibility of AI deployment in healthcare communication and insights into the requirements for successful integration into existing healthcare services.

Paper Structure

This paper contains 26 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Offline evaluation methods.(a) Multiple-choice medical exam questions assess French medical knowledge and clinical reasoning. (b) Real-world medical advice conversations evaluate response quality and relevance. (c) Simulated conversations with patient agents evaluate end-to-end information gathering and recommendation accuracy.
  • Figure 2: Transparent user interface (a) When patients initiate a conversation in the medical advice chat, Mo first reformulates their concern and explicitly asks for their preference: they can either start with Mo's assistance or opt to wait for a physician. (b) At the end of Mo interactions, physicians engage directly with the patient to acknowledge their oversight of the conversation, validate Mo's medical guidance, and provide complementary advice when necessary. Here, we also show the entry point for the user ratings survey.
  • Figure 3: Physician review interface for Mo messages. Physicians review each Mo message and select one of the four rating icons within 15 minutes. The right-most choice removes the message from the patient’s view.
  • Figure 4: Flow diagram of Mo deployment in medical advice conversations. Of 1,566 conversations where Mo was active, 640 (41%) were out of scope. Among eligible conversations (n = 926), Mo was proposed to 474 patients, with 452 as controls. After excluding no-responses (n = 53) and declines (n = 81), 340 patients opted to interact with Mo, of whom 298 (88%) completed their conversations. Percentages in parentheses represent rates adjusted for no-responses.
  • Figure 5: Patient ratings: comparison between Mo and control groups. Distribution of patient ratings for Mo and control groups across different dimensions. Top: Overall satisfaction rated on a 5-point scale (1:, 5:). Bottom: Specific dimensions (Empathy, Trust, Clarity) rated on a 4-point scale ('not at all' to 'perfectly'). Numbers on the right show mean scores. Asterisks (*) indicate statistically significant differences between groups (p < 0.05). Percentages show proportions of responses in each category.
  • ...and 4 more figures