Table of Contents
Fetching ...

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao

TL;DR

A dual-judge evaluation framework is introduced, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment.

Abstract

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

TL;DR

A dual-judge evaluation framework is introduced, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment.

Abstract

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
Paper Structure (50 sections, 7 equations, 11 figures, 5 tables)

This paper contains 50 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: ClinConsensus end-to-end pipeline. Clinical experts curate and de-identify real-world cases, annotate care stages, task--specialty coverage, and difficulty levels. Each case comprises 30 expert-crafted case-specific rubrics—classified into consensus-level and domain-specific types Two-stage quality control: (1) experts manually score rubric performance using outputs from three leading LLMs, and any case scoring above 50% is discarded to ensure difficulty; (2) senior reviewer audit of case content, rubrics, and reference answers. The final benchmark evaluates tested models under a dual-judge framework (LLM-as-judge and a locally deployable SFT trained judge), reporting CACS@$k$ based on rubric-hit scores and measuring judge--physician consistency.
  • Figure 2: Specialty coverage and specialty-wise performance on ClinConsensus (2,500 cases, 36 specialties). (a) Sunburst of case distribution by specialty (outer ring) grouped into five super-categories (inner ring); percentages denote the share of cases. (b) Specialty-wise CACS@7 of gpt-5.2. Each outer wedge corresponds to one specialty (equal angular width); numbers denote CACS@7 scores and color intensity encodes relative score magnitude within the model (darker indicates higher).
  • Figure 3: LLM-as-Judge input--output contract. The judge takes the full conversation context and one rubric item as input, and evaluates only the last assistant turn, outputting a schema-constrained JSON decision (explanation, criteria_met).
  • Figure 4: Task distribution and theme-wise model performance on ClinConsensus. (a) Distribution of all 2,500 cases across 12 task themes. The inner ring shows five macro themes (Condition Understanding, Disease Diagnosis, Treatment & Medication, Daily Patient Q&A, and Follow-up & Monitoring), and the outer ring shows fine-grained task categories; wedge size indicates the proportion of cases. (b) Theme-wise CACS@7 (%) for 15 evaluated models across the same task themes. Models are ordered from left to right by overall CACS@7 (high to low). Cell values report exact scores, and color intensity is normalized within each macro-theme group.
  • Figure 5: Model--physician agreement by task theme. Each point denotes the MF1 agreement between an automated grader and physician labels for one evaluated model within a theme; stars indicate the mean across evaluated models. Across themes, both LLM-as-judge graders and the distilled SFT judge exhibit consistently high agreement with physicians.
  • ...and 6 more figures