Table of Contents
Fetching ...

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He

Abstract

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Abstract

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries -- achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Paper Structure

This paper contains 28 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An Overview of MedIRT Framework, illustrating the three integrated components: (1) a USMLE-aligned benchmark with EFA-based content validation, (2) a large-scale LLM cohort, and (3) topic-level 2PL IRT modeling.
  • Figure 2: Internal Prediction Generalization Validation. Predictive accuracy on held-out LLM-item pairs under an 80/20 train-test split. †Accuracy-based evaluation is descriptive and does not define a generative model for predicting model-item interactions.
  • Figure 3: Heatmap of topic-wise IRT Ability ($\theta$) derived from pruned item set for the top 15 Models. Rows list models with the mean ability across topics; columns are topic abbreviations, according to Table \ref{['tab:dataset_distribution']}.
  • Figure 4: Difficulty-Tier Hit-Rate Profiles: Contrasting DSR and DIR Across Ability Strata. Each bar shows the percentage of questions answered correctly within that item difficulty tier, with figure labels (A, B) denoting the bottom-15 and top-15 LLM cohorts respectively. Within each subfigure, the left panel shows a clean monotonic decline. The right panel shows the opposite: hit rates rise from one tier to the next before falling again. Note that the two subfigures use different y-axis scales reflecting the distinct ability ranges of each cohort.
  • Figure 5: IRT ability ($\theta$) vs. raw accuracy per model--topic pair. Wide vertical spread at any given accuracy level indicates that IRT captures information beyond proportion correct. Red points are formally misfitting pairs (Zh $< -1.96$).
  • ...and 5 more figures