Table of Contents
Fetching ...

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Alvin Rajkomar, Pavan Sudarshan, Angela Lai, Lily Peng

Abstract

Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.

The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Abstract

Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the "patient" or "query" populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural "validity gap." While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling--analogous to clinical trial reporting--to align evaluation with the full complexity of clinical practice.
Paper Structure (53 sections, 5 figures, 4 tables)

This paper contains 53 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: CONSORT diagram showing study flow from initial assessment through tagging methods to final analyzed datasets.
  • Figure 2: The Evolution of the Validity Gap. Each generation of benchmarks advances structural capabilities but retains specific validity gaps. Generation 1 evaluates fact retrieval but lacks acuity for triage (1.2% triage intent, $<$2% high-risk). Generation 2 introduces clinical reasoning with rich objective data but relies on synthesized, static narratives (100% single-turn) where data is pre-packaged rather than elicited. Generation 3 achieves interactivity but suffers from sparse clinical content: $<$0.7% behavioral crisis, 5.5% chronic care, and clinical data density drops compared to Generation 2. Critically, demographic bias persists across all generations: pediatric and geriatric populations represent $<$11% of queries, validating models on a "standard adult default."
  • Figure 3: Distribution of health topics and query intents across three generations of benchmarks. (A) Health topics: Generation 1 (Search Era) queries span common symptoms and general health concerns. Generation 2 (Case Presentation Era) concentrates in internal medicine subspecialties with higher clinical complexity. Generation 3 (Interactive & Data Era) shows bimodal distribution: HealthBench Main covers diverse acute and chronic conditions, while GoogleFitbit datasets focus predominantly on wellness, sleep, and fitness tracking. (B) Query intents: Generation 1 benchmarks are dominated by queries to learn about health topics (92%), while Generation 3 benchmarks capture broader intent diversity including wellness and lifestyle guidance, medical research queries, and prevention and screening.
  • Figure 4: Distribution comparison between GPT-5.2 and Opus-4.5 across key dimensions. Bars show percentage of queries assigned to each category.
  • Figure 5: Confusion matrices showing agreement patterns for intent, risk sensitivity, and specialty. Values show row-normalized percentages. Strong diagonal dominance indicates high agreement.