Table of Contents
Fetching ...

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Zetian Ouyang, Yishuai Qiu, Linlin Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He

TL;DR

CliMedBench presents a large-scale, real-world Chinese clinical benchmark for evaluating medical LLMs across 14 core scenarios and seven evaluation axes, totaling 33,735 questions derived from de-identified EHRs and exam content. The authors develop a taxonomy-guided dataset with authentic clinical data and introduce an agent-based Computerized Adaptive Testing framework rooted in the IRT-3PL model to enable efficient, scalable model evaluation. Across 11 LLMs, the study finds that Chinese medical LLMs lag in reasoning and factual consistency compared with strong general-domain models, though some Chinese models approach GPT-4 in certain clinical dimensions; results also reveal brittleness to input length, perturbations, and limited multimodal capability. The work delivers practical insights into model weaknesses, introduces a scalable evaluation method, and highlights data authenticity and ethical considerations, with potential impact on clinical AI research and benchmark design through improved evaluation rigor and efficiency.

Abstract

With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

TL;DR

CliMedBench presents a large-scale, real-world Chinese clinical benchmark for evaluating medical LLMs across 14 core scenarios and seven evaluation axes, totaling 33,735 questions derived from de-identified EHRs and exam content. The authors develop a taxonomy-guided dataset with authentic clinical data and introduce an agent-based Computerized Adaptive Testing framework rooted in the IRT-3PL model to enable efficient, scalable model evaluation. Across 11 LLMs, the study finds that Chinese medical LLMs lag in reasoning and factual consistency compared with strong general-domain models, though some Chinese models approach GPT-4 in certain clinical dimensions; results also reveal brittleness to input length, perturbations, and limited multimodal capability. The work delivers practical insights into model weaknesses, introduces a scalable evaluation method, and highlights data authenticity and ethical considerations, with potential impact on clinical AI research and benchmark design through improved evaluation rigor and efficiency.

Abstract

With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.
Paper Structure (23 sections, 1 equation, 11 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of CliMedBench with "Who-What-How” taxonomy linking users with core clinical scenarios.
  • Figure 2: Workflow of collaboration between humans and LLMs for dataset construction.
  • Figure 3: Data distribution of clinical scenarios.
  • Figure 4: Human evaluation results of four aspects.
  • Figure 5: Accuracy comparison of four models on seven datasets using both vanilla and CoT prompts.
  • ...and 6 more figures