Table of Contents
Fetching ...

MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications

Praveenkumar Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Prateek Munjal, Nada Saadi, Hamza A Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

TL;DR

Clinical LLMs face a gap between static medical knowledge and real-world operational utility. The authors propose MEDIC, a modular framework with five dimensions and a hybrid evaluation strategy, including a cross-examination framework to assess factual fidelity without references and a public leaderboard for continuous benchmarking. Their findings reveal significant knowledge-execution gaps, task-dependent heterogeneity, and divergent safety performance between passive refusals and active error detection, arguing against single-model dominance. The work supports a portfolio approach to clinical AI deployment and provides a scalable, ongoing evaluation pathway to improve safety and utility in healthcare workflows.

Abstract

While Large Language Models (LLMs) achieve superhuman performance on standardized medical licensing exams, these static benchmarks have become saturated and increasingly disconnected from the functional requirements of clinical workflows. To bridge the gap between theoretical capability and verified utility, we introduce MEDIC, a comprehensive evaluation framework establishing leading indicators across various clinical dimensions. Beyond standard question-answering, we assess operational capabilities using deterministic execution protocols and a novel Cross-Examination Framework (CEF), which quantifies information fidelity and hallucination rates without reliance on reference texts. Our evaluation across a heterogeneous task suite exposes critical performance trade-offs: we identify a significant knowledge-execution gap, where proficiency in static retrieval does not predict success in operational tasks such as clinical calculation or SQL generation. Furthermore, we observe a divergence between passive safety (refusal) and active safety (error detection), revealing that models fine-tuned for high refusal rates often fail to reliably audit clinical documentation for factual accuracy. These findings demonstrate that no single architecture dominates across all dimensions, highlighting the necessity of a portfolio approach to clinical model deployment. As part of this investigation, we released a public leaderboard on Hugging Face.\footnote{https://huggingface.co/spaces/m42-health/MEDIC-Benchmark}

MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications

TL;DR

Clinical LLMs face a gap between static medical knowledge and real-world operational utility. The authors propose MEDIC, a modular framework with five dimensions and a hybrid evaluation strategy, including a cross-examination framework to assess factual fidelity without references and a public leaderboard for continuous benchmarking. Their findings reveal significant knowledge-execution gaps, task-dependent heterogeneity, and divergent safety performance between passive refusals and active error detection, arguing against single-model dominance. The work supports a portfolio approach to clinical AI deployment and provides a scalable, ongoing evaluation pathway to improve safety and utility in healthcare workflows.

Abstract

While Large Language Models (LLMs) achieve superhuman performance on standardized medical licensing exams, these static benchmarks have become saturated and increasingly disconnected from the functional requirements of clinical workflows. To bridge the gap between theoretical capability and verified utility, we introduce MEDIC, a comprehensive evaluation framework establishing leading indicators across various clinical dimensions. Beyond standard question-answering, we assess operational capabilities using deterministic execution protocols and a novel Cross-Examination Framework (CEF), which quantifies information fidelity and hallucination rates without reliance on reference texts. Our evaluation across a heterogeneous task suite exposes critical performance trade-offs: we identify a significant knowledge-execution gap, where proficiency in static retrieval does not predict success in operational tasks such as clinical calculation or SQL generation. Furthermore, we observe a divergence between passive safety (refusal) and active safety (error detection), revealing that models fine-tuned for high refusal rates often fail to reliably audit clinical documentation for factual accuracy. These findings demonstrate that no single architecture dominates across all dimensions, highlighting the necessity of a portfolio approach to clinical model deployment. As part of this investigation, we released a public leaderboard on Hugging Face.\footnote{https://huggingface.co/spaces/m42-health/MEDIC-Benchmark}
Paper Structure (42 sections, 5 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 42 sections, 5 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Five key dimensions of the MEDIC framework. Designed to bridge the gap between the expectations of all stakeholders and the practical application of language models in clinical settings. The interconnected dimensions capture the overlapping capabilities models must possess to perform practical tasks, which can be objectively measured using specific methods and metrics; thereby allowing their application in real-world clinical settings to be assessed more holistically.
  • Figure 2: Rank-based heatmap of model performance across MEDIC tasks. Green indicates top-tier performance (Rank 1), while red indicates lower relative standing. Gray cells (NA) denote tasks where the model could not be evaluated due to context length limitations. The heterogeneous distribution of rankings illustrates that no single model consistently dominates across all clinical dimensions, highlighting that performance is highly task-dependent and necessitates trade-offs between reasoning capability, safety compliance, and architectural constraints. Some model names have been abbreviated for conciseness.
  • Figure 3: Information fidelity is not strictly correlated with model scale and is poorly measured by traditional metrics. (a) Performance of the top-10 models based on average CEF score. (b) Scatter plot illustrating the relationship between Conformity (non-contradiction) and Consistency (absence of hallucination). Marker size represents model parameter count; larger models tend to show lower conformity, suggesting they are more likely to introduce information that contradicts the source document. (c) Spearman correlation heatmap between CEF fidelity scores (columns) and traditional lexical metrics (rows). The negligible correlations indicate that traditional metrics may fail to capture the dimensions of factual correctness measured by CEF.
  • Figure 4: High proficiency in static knowledge and passive safety does not guarantee functional execution or active error detection. (a) Distribution of normalized scores comparing knowledge-based benchmarks (MedQA, MedMCQA) against operational tasks (MedCalc, EHRSQL). Dashed lines indicate the median performance. While knowledge tasks show saturation near the upper bound, operational tasks display significantly higher variance and lower median scores, evidencing a distinct knowledge-execution gap. (b) Comparison of passive safety (refusal) with active safety (error correction). While most models achieve near-optimal refusal rates (i.e., score 1) on Med-Safety (left axis), performance degrades sharply on MEDEC (middle and right axis). The steep decline in performance from passive safety to active safety highlights the inability of current architectures to reliably verify clinical factuality despite high safety compliance.
  • Figure 5: Open-ended clinical capabilities are consistently ranked across distinct judges. (a) Forest plot of Elo ratings for open-ended clinical inquiry tasks. Ratings are computed from pairwise comparisons evaluated by three independent judge models (Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct, and DeepSeek-V3.1). Error bars denote 95% confidence intervals. Model rankings are largely preserved across judges, indicating limited sensitivity to the choice of adjudicator. (b) Spearman rank correlation of model rankings between judge models. High correlation values ($\rho \geq 0.98$) indicate strong agreement across judges, supporting the robustness of pairwise evaluation protocol.
  • ...and 2 more figures