Table of Contents
Fetching ...

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu

TL;DR

Doctor-R1 addresses the gap between static medical knowledge and dynamic clinical inquiry by introducing an experiential agentic reinforcement learning framework. It unifies strategic multi-turn inquiry and diagnostic decision-making within a single doctor agent, leveraging a dynamic multi-agent environment, a two-tier reward system, and an experience repository to ground learning in high-quality trajectories. Empirical results on HealthBench and MAQuE show state-of-the-art performance for an 8B model, with consistent improvements in communication quality, empathy, and task accuracy, supported by strong human preferences. The work demonstrates that learning from high-quality experiences and grounding policy in interactive simulations can enhance both the safety and effectiveness of AI-driven clinical consultations, while acknowledging ethical considerations and the need for careful deployment guidance.

Abstract

The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

TL;DR

Doctor-R1 addresses the gap between static medical knowledge and dynamic clinical inquiry by introducing an experiential agentic reinforcement learning framework. It unifies strategic multi-turn inquiry and diagnostic decision-making within a single doctor agent, leveraging a dynamic multi-agent environment, a two-tier reward system, and an experience repository to ground learning in high-quality trajectories. Empirical results on HealthBench and MAQuE show state-of-the-art performance for an 8B model, with consistent improvements in communication quality, empathy, and task accuracy, supported by strong human preferences. The work demonstrates that learning from high-quality experiences and grounding policy in interactive simulations can enhance both the safety and effectiveness of AI-driven clinical consultations, while acknowledging ethical considerations and the need for careful deployment guidance.

Abstract

The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.

Paper Structure

This paper contains 62 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The Gap Between Static Medical Desion Task and Dynamic Clinical Inquiry: Top-tiers models demonstrate high performance on static medical decision benchmarks. However, these models adhere to a generic script in a high-risk clinical case, failing to adapt to the complex scenario. Doctor-R1 demonstrates a strategic inquiry process, showing the effectiveness of our framework.
  • Figure 2: The interactive training loop of our Doctor-R1 framework. The process unfolds within a (1) Dynamic Interactive Environment populated by diverse patient simulations. At (2) each turn of an inquiry session, the (3) Doctor Agent interacts with the environment by observing the state, queries the (4) Experience Repository, and selects an action. A Patient Agent responds, and the (5) Consultation Evaluator evaluates the action based on the two-tiered reward architecture. This new experience is stored into the repository, and is used to optimize the policy of Doctor Agent.
  • Figure 3: Results of the pairwise human evaluation across four qualitative metrics. Each bar shows the distribution of wins (blue), ties (yellow), and losses (red) for a model compared against all others.
  • Figure 4: An ablation study comparing the experience retrieval mechanism of Doctor-R1 against baseline methods.
  • Figure 5: Scaling analysis of key framework components. (a) The impact of dialogue turns on task accuracy. (b) The impact of the number of simulated patient agents used in training.