Table of Contents
Fetching ...

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu

TL;DR

MAQuE introduces the Medical Agent Questioning Evaluation benchmark, the largest framework to date for automatic, multi-turn medical inquiry assessment. It combines 3,000 diverse simulated patient agents with a five-dimensional evaluation scheme and AIU-based analysis to scrutinize how AI doctors gather information, conduct dialogue, and consider patient experience. Across a wide range of LLMs, results reveal substantial gaps in proactive inquiry, empathy, and robustness to patient behavior, highlighting trade-offs between diagnostic accuracy and inquiry quality. The study provides a foundation for designing better incentive structures and training environments to develop AI doctors capable of humane, information-rich, and efficient clinical consultations.

Abstract

An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

TL;DR

MAQuE introduces the Medical Agent Questioning Evaluation benchmark, the largest framework to date for automatic, multi-turn medical inquiry assessment. It combines 3,000 diverse simulated patient agents with a five-dimensional evaluation scheme and AIU-based analysis to scrutinize how AI doctors gather information, conduct dialogue, and consider patient experience. Across a wide range of LLMs, results reveal substantial gaps in proactive inquiry, empathy, and robustness to patient behavior, highlighting trade-offs between diagnostic accuracy and inquiry quality. The study provides a foundation for designing better incentive structures and training environments to develop AI doctors capable of humane, information-rich, and efficient clinical consultations.

Abstract

An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

Paper Structure

This paper contains 45 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison between MAQuE and existing benchmarks. MAQuE enables more realistic patient simulation by integrating diverse behaviors and evaluates doctor inquiries from more comprehensive and fine-grained perspectives.
  • Figure 2: Pipeline for constructing patient profiles with simulated human-like behaviors.
  • Figure 3: Evaluation results of LLMs' inquiry capabilities across the fixed inquiry rounds.
  • Figure 4: Comparison of LLMs' inquiry and diagnosis capabilities, with diagnostic performance evaluated based on the interaction history generated by GPT-4o as the inquiry model with our simulated patient.
  • Figure 5: Distribution of data instances across different medical departments in the dataset.
  • ...and 2 more figures