Table of Contents
Fetching ...

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, Yixuan Yuan

TL;DR

This work tackles the absence of inpatient-specific AI benchmarks and large-scale datasets by introducing IPDS, a MIMIC-IV-derived benchmark covering 9 departments, 17 diseases, and 16 treatment pathways. They propose MAP, a multi-agent framework with a triage, diagnosis, and treatment team guided by a chief agent, augmented by a record-review module, a trainable retrieval-enhanced generation component, and an expert-guidance mechanism to ensure diagnostic rigor. Across IPDS, MAP yields a 78.10% diagnostic accuracy, a 25.10% gain over HuatuoGPT2-13B, and 10–12% higher clinical compliance than three board-certified clinicians, demonstrating strong potential for real-world inpatient pathway support. The results highlight the importance of comprehensive data integration (medical history, radiology, demographics) and structured, explainable reasoning in AI-assisted inpatient decision-making, with implications for deployment and future benchmarking in hospital settings.

Abstract

Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

TL;DR

This work tackles the absence of inpatient-specific AI benchmarks and large-scale datasets by introducing IPDS, a MIMIC-IV-derived benchmark covering 9 departments, 17 diseases, and 16 treatment pathways. They propose MAP, a multi-agent framework with a triage, diagnosis, and treatment team guided by a chief agent, augmented by a record-review module, a trainable retrieval-enhanced generation component, and an expert-guidance mechanism to ensure diagnostic rigor. Across IPDS, MAP yields a 78.10% diagnostic accuracy, a 25.10% gain over HuatuoGPT2-13B, and 10–12% higher clinical compliance than three board-certified clinicians, demonstrating strong potential for real-world inpatient pathway support. The results highlight the importance of comprehensive data integration (medical history, radiology, demographics) and structured, explainable reasoning in AI-assisted inpatient decision-making, with implications for deployment and future benchmarking in hospital settings.

Abstract

Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.

Paper Structure

This paper contains 22 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Illustration of the Inpatient Pathway Decision Support (IPDS) benchmark. (a) The statistics, processing, and data sources of the IPDS benchmark. The IPDS contains 51,274 cases across 9 departments, 17 diseases (D1-D17), and 16 treatments (T1-T16), and provides the comprehensive evaluation of LLMs in different inpatient scenarios. The specific abbreviations and details of the diagnosis options were detailed in Table \ref{['tab: dise_table']}. (b) The evaluation of the IPDS benchmark. (c) The Sankey diagram of the IPDS benchmark. This diagram visualizes the data distribution in the workflow of different inpatient scenarios. The specific abbreviations and details of the department, disease, and treatment options are provided in the supplementary materials.
  • Figure 2: Overview of the Multi-Agent Inpatient Pathways (MAP) framework. The MAP is a multi-agent collaborative framework that simulates the inpatient pathway flow. The framework consists of LLM-empowered agents: a triage agent for department triage, a diagnosis agent for each department for the clinical decision-maker, a treatment agent for the treatment plan, and a chief agent for overseeing diagnosis and treatment pathways. Three key components support our MAP framework: (1) a record review module that analyzes patient data, including demographic information, radiological reports, and medical history; (2) a trainable REG module that integrates clinical knowledge bases with chain-of-thought reasoning to support reliable diagnostic decision-making; and (3) an expert guidance module that ensures diagnostic rigor through structured supervision of the diagnosis agent.
  • Figure 3: The MAP demonstrated the enhanced capabilities in supporting inpatient pathways compared to state-of-the-art LLMs. Current general LLMs showed limited performance in diagnostic support, with an accuracy below 53.00%, while specialized medical models such as Clinical-Camel-70B achieve 47.50% accuracy. The MAP achieved an overall diagnostic support accuracy of 78.10%, showing improvements of 30.60%, 27.20%, and 25.10% over Clinical-Camel-70B, Meditron-70B, and HuatuoGPT2-13B ($p<0.001$ for all comparisons). The best and second-best performance values are marked for better clarity. These results demonstrated the potential of the MAP as a clinical decision-support tool across diverse inpatient scenarios.
  • Figure 4: State-of-the-art LLMs demonstrated unsatisfying inpatient diagnostic support capabilities for complex inpatient clinical cases; in contrast, MAP enhances that capability. (a) For instance, HuatuoGPT2-13B performed a low accuracy of 38.46% in D5 (mental and behavioral disorders). The MAP showed significant improvement in supporting the inpatient pathways, achieving an accuracy of 79.86% in D5. (b) Such an enhancement was attributed to the integration of our proposed record review, trainable REG, and expert guidance modules, verified by the corresponding ablation studies.
  • Figure 5: Evaluation of LLMs performance in supporting inpatient pathways through IPDS benchmark. (1) The MAP demonstrated consistent diagnostic support capabilities with minimal performance variance, achieving accuracy 10%-12% higher than board-certified clinicians (p-value = $0.0067$). (2) Statistical analysis revealed strong agreement between the MAP and the ground truth ($\textup{ICC} =0.81$), exceeding the agreement levels between individual clinicians and the ground truth ($\textup{ICC}\in[0.67,0.68]$). (3) The MAP maintained a strong alignment with clinicians($\textup{ICC}\in[0.75,0.84]$) while providing consistent diagnostic support.
  • ...and 7 more figures