LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun
TL;DR
LiveMedBench introduces a contamination-free, weekly-updated medical benchmark that combines a Multi-Agent Clinical Curation Framework with an Automated Rubric-based Evaluation Framework to address data leakage and knowledge obsolescence in medical AI evaluation. By harvesting real-world cases from multilingual online medical communities and validating them against evidence-based principles, it provides 2,756 cases across 38 specialties and 16,702 case-specific rubrics. Evaluations on 38 LLMs show notable post-cutoff performance drops, underscoring the dangers of static benchmarks and data contamination, while retrieval-augmented knowledge injection mitigates these losses. The framework offers finer-grained, physician-aligned assessment and highlights the dominant role of contextual application and safety in clinical AI performance, with broad implications for developing and validating clinical decision-support systems.
Abstract
The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.
