Table of Contents
Fetching ...

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu, Xuehai Wang, Yuxuan Jiang, Tingting Yu, Yunqing Hong, Jiayi Liu, Rianzhe Huang, Shuxin Zhao, Haiping Hu, Wen Shang, Jian Xu, Guanjun Jiang

Abstract

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Abstract

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.
Paper Structure (63 sections, 7 equations, 4 figures, 14 tables)

This paper contains 63 sections, 7 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Performance of mainstream LLMs on QuarkMedBench. This grouped bar chart compares the scores of 14 models without length constraints (hatched bars) versus those with a strict length constraint of $\le$1000 words (solid bars).
  • Figure 2: Automated Scoring Rubric (Auto-Rubrics) Generation Pipeline
  • Figure 3: An empirical example of the synthesized Ground Truth. The JSON schema demonstrates the pipeline's capability to decompose complex medical concepts into three granular dimensions: essential facts (imp), deep clinical insights (aha), and extended knowledge (ext), while successfully capturing discrepancies across international clinical guidelines.
  • Figure 4: Task classification distribution comparison between QuarkMedBench and Healthbench datasets. Left panel (1) shows the distribution across six main categories: Basic Inquiry, Pre-consultation, During-treatment, Post-treatment, Professional Medical, and Special Cases. Right panel (2) displays the top 15 specific task labels ranked by combined frequency. Percentages are calculated relative to the total number of samples in each dataset. QuarkMedBench demonstrates a balanced distribution across foundational categories including Basic Inquiry (30.0% vs 15.8%) and Pre-consultation (18.0% vs 12.1%), aligning with the full medical consultation lifecycle.