Table of Contents
Fetching ...

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

TL;DR

AI Hospital addresses the gap between static medical benchmarks and real-world clinical practice by simulating dynamic doctor-patient interactions with NPCs and a Doctor agent. The MVME benchmark evaluates LLM-driven doctors on symptom elicitation, exam planning, and diagnostic reasoning using Chinese medical records, while a dispute-resolution collaboration mechanism aims to improve diagnostic accuracy. Experiments reveal a substantial gap between interactive LLMs and the one-step GPT-4 upper bound, though multi-agent collaboration and structured dispute resolution provide meaningful gains. The framework and open-source data/code offer a scalable platform for advancing AI-assisted clinical decision support and medical education.

Abstract

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

TL;DR

AI Hospital addresses the gap between static medical benchmarks and real-world clinical practice by simulating dynamic doctor-patient interactions with NPCs and a Doctor agent. The MVME benchmark evaluates LLM-driven doctors on symptom elicitation, exam planning, and diagnostic reasoning using Chinese medical records, while a dispute-resolution collaboration mechanism aims to improve diagnostic accuracy. Experiments reveal a substantial gap between interactive LLMs and the one-step GPT-4 upper bound, though multi-agent collaboration and structured dispute resolution provide meaningful gains. The framework and open-source data/code offer a scalable platform for advancing AI-assisted clinical decision support and medical education.

Abstract

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.
Paper Structure (31 sections, 6 figures, 15 tables, 1 algorithm)

This paper contains 31 sections, 6 figures, 15 tables, 1 algorithm.

Figures (6)

  • Figure 1: The demonstration of AI Hospital framework.
  • Figure 2: Statistical analysis of discussion rounds in collaborative frameworks with and without "Dispute Resolution" mechanism.
  • Figure 3: An example of dialogue flow among Doctor, Patient, Examiner and Chief Physician in AI Hospital framework.
  • Figure 4: Collaboration of Doctors for clinical diagnosis.
  • Figure 5: Linear regression analysis among symptoms, medical examinations and diagnostic results, diagnostic rationales, and treatment plan.
  • ...and 1 more figures