Table of Contents
Fetching ...

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

TL;DR

LiveMedBench introduces a contamination-free, weekly-updated medical benchmark that combines a Multi-Agent Clinical Curation Framework with an Automated Rubric-based Evaluation Framework to address data leakage and knowledge obsolescence in medical AI evaluation. By harvesting real-world cases from multilingual online medical communities and validating them against evidence-based principles, it provides 2,756 cases across 38 specialties and 16,702 case-specific rubrics. Evaluations on 38 LLMs show notable post-cutoff performance drops, underscoring the dangers of static benchmarks and data contamination, while retrieval-augmented knowledge injection mitigates these losses. The framework offers finer-grained, physician-aligned assessment and highlights the dominant role of contextual application and safety in clinical AI performance, with broad implications for developing and validating clinical decision-support systems.

Abstract

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

TL;DR

LiveMedBench introduces a contamination-free, weekly-updated medical benchmark that combines a Multi-Agent Clinical Curation Framework with an Automated Rubric-based Evaluation Framework to address data leakage and knowledge obsolescence in medical AI evaluation. By harvesting real-world cases from multilingual online medical communities and validating them against evidence-based principles, it provides 2,756 cases across 38 specialties and 16,702 case-specific rubrics. Evaluations on 38 LLMs show notable post-cutoff performance drops, underscoring the dangers of static benchmarks and data contamination, while retrieval-augmented knowledge injection mitigates these losses. The framework offers finer-grained, physician-aligned assessment and highlights the dominant role of contextual application and safety in clinical AI performance, with broad implications for developing and validating clinical decision-support systems.

Abstract

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.
Paper Structure (69 sections, 12 equations, 18 figures, 11 tables)

This paper contains 69 sections, 12 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: (A) Temporal degradation. Model performance consistently declines on clinical cases that post-date their training knowledge cutoffs, highlighting the risk of data contamination. (B) Evaluation alignment. Our proposed Automated Rubric-based Evaluation Framework aligns better with physician experts compared to LLM-as-a-Judge.
  • Figure 2: Overview of the LiveMedBench framework. The pipeline consists of five phases: (a) Continuous mining of bilingual clinical data from verified online communities; (b) A Multi-Agent Curation Framework (Screener, Validator, Controller) that structures and validates data against medical guidelines; (c) Automated generation of case-specific evaluation rubrics; (d) Objective evaluation of LLMs using the generated rubrics; and (e) Rigorous human quality assurance to ensure clinical alignment.
  • Figure 3: Data statistics of LiveMedBench. The figure illustrates the comprehensive distribution of (a) 38 clinical specialties, (b) five behavioral themes, (c) data sources and languages , (d) evaluation axes, and (e) the number of grading criteria per case (Mean=6.06).
  • Figure 4: The evaluation results of 38 LLMs on LiveMedBench, categorized into proprietary (green) and open-source (yellow) models. Models marked with a cross (+) are specialized medical models, while others are general-purpose. Solid bars represent performance on the full dataset, while hatched bars indicate performance on cases post-dating the model's knowledge cutoff.
  • Figure 5: Multi-dimensional performance analysis. (Left) Heatmap illustrating the score distribution of representative models across 38 clinical specialties. (Right) Radar chart depicting the capability profiles of representative models across the five behavioral themes.
  • ...and 13 more figures