ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan; Haitian Liu; Tengxiao Wu; Qian Chen; Wen Wang; Haoyuan Chai; Jiayi Wang; Weishan Zhao; Yixin Zhang; Renjun Zhang; Li Zhu; Xuandong Zhao

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu, Xuandong Zhao

TL;DR

ClinicalLab tackles the gap between AI diagnostic capabilities and real-world clinical practice by introducing ClinicalBench, ClinicalMetrics, and ClinicalAgent. ClinicalBench provides a leakage-free, end-to-end benchmark across 24 departments and 150 diseases with eight diagnostic tasks to mirror patient journeys; ClinicalMetrics offers four pragmatic evaluation metrics for department-guided and diagnostic effectiveness, complemented by robust synonym alignment. The experiments reveal department-specific strengths among LLMs and show that a multi-agent ClinicalAgent with department-specialized models yields superior end-to-end performance. By enabling realistic, multi-disciplinary evaluation and guided agent design, this work lays a foundation for safer, more reliable AI-assisted clinical decision-making in real-world hospital settings.

Abstract

LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 7 figures, 29 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 7 figures, 29 tables, 1 algorithm.

Introduction
Related Work
Existing Medical Benchmarks
Existing Agents for Medical Applications
ClinicalBench: An End-to-End, Real-Case-based, Data-Leakage-Free Benchmark for Multi-Department Clinical Diagnostic Evaluation
Data Sources & Licenses
Data Processing & Quality
Data Statistics
Task Overview
Department Guide (Multi-Choice QA with 24 options)
Clinical Diagnosis (Generative QA)
Imaging Diagnosis (Generative QA)
Experiments of LLMs on ClinicalBench
Models
Evaluation Metrics (ClinicalMetrics)
...and 18 more sections

Figures (7)

Figure 1: The workflow diagram of ClinicalAgent. ClinicalAgent covers the entire process starting from the moment a patient enters the clinic and ending when the patient is discharged, which includes six key steps: 1) department guide; 2) preliminary consultation; 3) laboratory examination; 4) imageological examination; 5) final consultation; 6) medical treatment.
Figure 2: The data management pipeline for ClinicalBench.
Figure 3: Departments and distribution of case samples in ClinicalBench.
Figure 4: Ranking of different LLMs across departments, with the x-axis representing department abbreviations (abbreviations correspond to Figure \ref{['fig:stat_department']}) and the y-axis representing the models. The ranking of each department is determined based on the Avg. Score metric described in Table \ref{['table:bench']}.
Figure 5: Performance trends of ClinicalAgent and LLMs using different evaluation metrics.
...and 2 more figures

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

TL;DR

Abstract

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Authors

TL;DR

Abstract

Table of Contents

Figures (7)