Table of Contents
Fetching ...

Towards Conversational AI for Disease Management

Anil Palepu, Valentin Liévin, Wei-Hung Weng, Khaled Saab, David Stutz, Yong Cheng, Kavita Kulkarni, S. Sara Mahdavi, Joëlle Barral, Dale R. Webster, Katherine Chou, Avinatan Hassidim, Yossi Matias, James Manyika, Ryutaro Tanno, Vivek Natarajan, Adam Rodman, Tao Tu, Alan Karthikesalingam, Mike Schaekermann

TL;DR

The paper tackles the challenge of management reasoning in disease care by advancing AMIE, an LLM-based agentic system that reasons across patient evolution and multiple visits. It introduces a dual-agent architecture (Dialogue Agent for conversational interaction and Mx Agent for long-context, guideline-grounded planning) built on Gemini with in-context retrieval and structured generation. Grounding is achieved through a large corpus of guidelines (NICE and BMJ) and a new RxQA benchmark for medication reasoning derived from OpenFDA and the British National Formulary. In a randomized, blinded virtual OSCE study with 100 scenarios across five specialties, AMIE demonstrated non-inferiority to primary care physicians in management reasoning, superior precision and guideline alignment, and stronger medication reasoning on higher-difficulty items, marking a significant step toward AI-assisted, longitudinal disease management while acknowledging limitations and need for further validation before clinical deployment.

Abstract

While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.

Towards Conversational AI for Disease Management

TL;DR

The paper tackles the challenge of management reasoning in disease care by advancing AMIE, an LLM-based agentic system that reasons across patient evolution and multiple visits. It introduces a dual-agent architecture (Dialogue Agent for conversational interaction and Mx Agent for long-context, guideline-grounded planning) built on Gemini with in-context retrieval and structured generation. Grounding is achieved through a large corpus of guidelines (NICE and BMJ) and a new RxQA benchmark for medication reasoning derived from OpenFDA and the British National Formulary. In a randomized, blinded virtual OSCE study with 100 scenarios across five specialties, AMIE demonstrated non-inferiority to primary care physicians in management reasoning, superior precision and guideline alignment, and stronger medication reasoning on higher-difficulty items, marking a significant step toward AI-assisted, longitudinal disease management while acknowledging limitations and need for further validation before clinical deployment.

Abstract

While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.

Paper Structure

This paper contains 78 sections, 42 figures, 7 tables.

Figures (42)

  • Figure 1: System architecture. AMIE is a symbiosis of two specialized agents: the Dialogue Agent, whose role is to converse with the patient to collect information and communicate clinical decisions, and the Mx Agent, whose role is to browse full-text clinical guidelines and compile tailored management plans, which are delivered to the Dialogue Agent.
  • Figure 2: Reasoning and planning under structural constraints. Inference-time decoding constraints are applied to constrain the model output to a predefined JSON structure and sets of values. The structures are defined in Python code, generated based on the set of retrieved guidelines and automatically converted into decoding constraints. The corresponding JSON schema is appended to the prompt. Panel A displays the target structure represented as Python code. Panel B illustrates a reasoning trace generated using this structure, represented as a tree. Excerpts C & D show parts of the reasoning steps. Excerpt E shows a section of the generated plan. Each plan item is annotated with generated references to the source documents through in-context reasoning over the set of retrieved guidelines.
  • Figure 3: Evaluation: overview of randomized study design. A primary care physician (PCP) and AMIE perform (in a randomized order) three virtual remote OSCE visits with simulated patients via online multi-turn synchronous text chat, building upon an initial patient presentation with subsequent updates on symptoms, treatment responses, and test results. Both the PCP and AMIE have access to a corpus of clinical guidelines. After each visit, the PCP and AMIE complete a post-questionnaire, and both are evaluated by patient actors and specialist physicians across a range of axes, including diagnostic accuracy, guideline entailment, management reasoning, and clinical communication skills.
  • Figure 4: Management plan quality. The quality of management plans for each of three visits per scenario, measured as the proportion of cases where AMIE or PCPs received favorable ratings from specialist physicians. We present overall quality criteria alongside criteria specific to investigations and treatments respectively. For each category, we include two quality criteria address the use of clinical guidelines (right). Of 15 evaluation axes tested, 9 were based on Yes/No ratings scales. The remaining ones were binarized using the top-2 options on their respective scales: 'Overall Appropriate' (5-point scale), 'Selected Applicable Guidelines' (5-point scale), 'Aligned with Guidelines' (5-point scale), 'References Guidelines' (4-point scale). For each evaluation axis, cases with 'N/A' ratings on either study arm were excluded for each visit. The sample size is N=100 scenarios for all evaluation axes and visits, except 'Sufficiently Precise' (Investigations: N=92, N=82, N=72 for three visits; Treatments: N=72, N=83, N=79 for three visits). P-values from McNemar test are shown for all comparisons with $p<0.05$ after false discovery rate (FDR) correction.
  • Figure 5: Management Reasoning Empirical Key Features (MXEKF). Relative performance of PCPs and AMIE on each of the 10 MXEKF evaluation axes in terms of preferences expressed by specialist physicians and patient actors respectively. Preferences were derived from independent ratings (on a 5-point scale ranging from 'Poor' to 'Excellent') for each of three visits per scenario. For 3 of the 10 MXEKF evaluation axes (Contrasting and Selection, Illness-Specific Knowledge, Prognostication), ratings were collected only from specialist physicians. Cases with identical ratings and those including one or more N/A ratings were grouped together as 'Tie or N/A'. Error bars represent 95% confidence intervals for binomial proportions.
  • ...and 37 more figures