Table of Contents
Fetching ...

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu, Xuandong Zhao

TL;DR

ClinicalLab tackles the gap between AI diagnostic capabilities and real-world clinical practice by introducing ClinicalBench, ClinicalMetrics, and ClinicalAgent. ClinicalBench provides a leakage-free, end-to-end benchmark across 24 departments and 150 diseases with eight diagnostic tasks to mirror patient journeys; ClinicalMetrics offers four pragmatic evaluation metrics for department-guided and diagnostic effectiveness, complemented by robust synonym alignment. The experiments reveal department-specific strengths among LLMs and show that a multi-agent ClinicalAgent with department-specialized models yields superior end-to-end performance. By enabling realistic, multi-disciplinary evaluation and guided agent design, this work lays a foundation for safer, more reliable AI-assisted clinical decision-making in real-world hospital settings.

Abstract

LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

TL;DR

ClinicalLab tackles the gap between AI diagnostic capabilities and real-world clinical practice by introducing ClinicalBench, ClinicalMetrics, and ClinicalAgent. ClinicalBench provides a leakage-free, end-to-end benchmark across 24 departments and 150 diseases with eight diagnostic tasks to mirror patient journeys; ClinicalMetrics offers four pragmatic evaluation metrics for department-guided and diagnostic effectiveness, complemented by robust synonym alignment. The experiments reveal department-specific strengths among LLMs and show that a multi-agent ClinicalAgent with department-specialized models yields superior end-to-end performance. By enabling realistic, multi-disciplinary evaluation and guided agent design, this work lays a foundation for safer, more reliable AI-assisted clinical decision-making in real-world hospital settings.

Abstract

LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
Paper Structure (33 sections, 5 equations, 7 figures, 29 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 7 figures, 29 tables, 1 algorithm.

Figures (7)

  • Figure 1: The workflow diagram of ClinicalAgent. ClinicalAgent covers the entire process starting from the moment a patient enters the clinic and ending when the patient is discharged, which includes six key steps: 1) department guide; 2) preliminary consultation; 3) laboratory examination; 4) imageological examination; 5) final consultation; 6) medical treatment.
  • Figure 2: The data management pipeline for ClinicalBench.
  • Figure 3: Departments and distribution of case samples in ClinicalBench.
  • Figure 4: Ranking of different LLMs across departments, with the x-axis representing department abbreviations (abbreviations correspond to Figure \ref{['fig:stat_department']}) and the y-axis representing the models. The ranking of each department is determined based on the Avg. Score metric described in Table \ref{['table:bench']}.
  • Figure 5: Performance trends of ClinicalAgent and LLMs using different evaluation metrics.
  • ...and 2 more figures