AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan; Jialong Tang; Wei Chen; Siyuan Wang; Zhongyu Wei; Jun Xi; Fei Huang; Jingren Zhou

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

TL;DR

AI Hospital addresses the gap between static medical benchmarks and real-world clinical practice by simulating dynamic doctor-patient interactions with NPCs and a Doctor agent. The MVME benchmark evaluates LLM-driven doctors on symptom elicitation, exam planning, and diagnostic reasoning using Chinese medical records, while a dispute-resolution collaboration mechanism aims to improve diagnostic accuracy. Experiments reveal a substantial gap between interactive LLMs and the one-step GPT-4 upper bound, though multi-agent collaboration and structured dispute resolution provide meaningful gains. The framework and open-source data/code offer a scalable platform for advancing AI-assisted clinical decision support and medical education.

Abstract

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

TL;DR

Abstract

Paper Structure (31 sections, 6 figures, 15 tables, 1 algorithm)

This paper contains 31 sections, 6 figures, 15 tables, 1 algorithm.

Introduction
Setup of AI Hospital
Agents Setup with Medical Records
Agent Behavior Setting for NPCs
Agent Behavior Setting for Player
Dialogue Flow in AI Hospital
MVME: Evaluation of LLMs as Intern Doctors for Clinical Diagnosis
Multi-View Evaluation Criteria
MVME Dataset Construction
Collaborative Diagnosis of LLMs Focused on Dispute Resolution
Experiments
Agent Behavior Analysis in AI Hospital Framework
Can LLMs Diagnose Like Doctors?
Further Analysis
Collaboration Mechanism
...and 16 more sections

Figures (6)

Figure 1: The demonstration of AI Hospital framework.
Figure 2: Statistical analysis of discussion rounds in collaborative frameworks with and without "Dispute Resolution" mechanism.
Figure 3: An example of dialogue flow among Doctor, Patient, Examiner and Chief Physician in AI Hospital framework.
Figure 4: Collaboration of Doctors for clinical diagnosis.
Figure 5: Linear regression analysis among symptoms, medical examinations and diagnostic results, diagnostic rationales, and treatment plan.
...and 1 more figures

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

TL;DR

Abstract

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Authors

TL;DR

Abstract

Table of Contents

Figures (6)