Table of Contents
Fetching ...

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari

TL;DR

MedVAL introduces a self-supervised distillation framework to train evaluators that validate LM-generated medical text for factual consistency without physician labels or reference outputs. By generating synthetic perturbations of outputs at controlled degradation levels, filtering via a consistency metric, and fine-tuning with a small, high-quality dataset, MedVAL strengthens risk-based validation across six medical tasks and multilingual settings. On MedVAL-Bench (840 physician-annotated outputs), MedVAL improves average four-class F1 from 0.367 to 0.510 and safe/unsafe F1 from 0.662 to 0.828 across 10 LMs, with GPT-4o achieving 0.587 post-distillation and showing non-inferiority to a single human expert on a subset. The work demonstrates robust generalization to unseen tasks and distribution shifts, and provides open-source code, datasets, and a high-quality open-model MedVAL-4B to support scalable, privacy-preserving clinical validation of LM-generated medical text.

Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

TL;DR

MedVAL introduces a self-supervised distillation framework to train evaluators that validate LM-generated medical text for factual consistency without physician labels or reference outputs. By generating synthetic perturbations of outputs at controlled degradation levels, filtering via a consistency metric, and fine-tuning with a small, high-quality dataset, MedVAL strengthens risk-based validation across six medical tasks and multilingual settings. On MedVAL-Bench (840 physician-annotated outputs), MedVAL improves average four-class F1 from 0.367 to 0.510 and safe/unsafe F1 from 0.662 to 0.828 across 10 LMs, with GPT-4o achieving 0.587 post-distillation and showing non-inferiority to a single human expert on a subset. The work demonstrates robust generalization to unseen tasks and distribution shifts, and provides open-source code, datasets, and a high-quality open-model MedVAL-4B to support scalable, privacy-preserving clinical validation of LM-generated medical text.

Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

Paper Structure

This paper contains 40 sections, 2 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: a) MedVAL test-time workflow. A generator LM produces an output, and MedVAL then assesses the output's factual consistency with the input, while assigning a risk grade and determining whether the output is safe for deployment or not. b) Study framework. 12 physicians assess 840 LM-generated medical text outputs. Using physician assessments as reference, we measure the accuracy of LMs in medical text validation across 10 LMs, 2 methods (baseline vs. MedVAL), and 6 tasks.
  • Figure 2: MedVAL self-supervised data curation illustrated through a radiology report summarization example. 1) A generator $g_\theta$ takes as input $x$, and generates a clean and a perturbed output using a random perturbation level $\delta \in [0, 1]$. 2) A validator $v_\phi$ then provides a detailed error assessment, predicts the factual degradation level $\hat{\delta}_{clean}$ and $\hat{\delta}_{corrupt}$ of the clean and perturbed outputs, respectively, and filters data with high generator-validator consistency for fine-tuning an arbitrary LM.
  • Figure 3: Performance benchmark (F1 score). a) We report the performance of LMs before and after MedVAL distillation. b) We rank all LMs (low to high), grouped into three methods. c) We report the $\Delta$ F1 score between MedVAL and baseline LM performance across each prediction class. Baseline indicates zero-shot LM before distillation, Baseline (larger comparator) indicates a larger zero-shot LM as reference (not chosen for distillation), and MedVAL indicates LM after distillation. $^*p<0.001$indicates statistically significant difference in classification performance of MedVAL and baseline (McNemar test). Notably, smaller MedVAL LMs match or exceed the performance of much larger baseline LMs. Furthermore, MedVAL Qwen3-4B (52.7%) and GPT-4o (58.7%) achieve the highest F1 score ranking under respective categories.
  • Figure 4: Representative examples of validation of LM-generated medical text by 1) the physician, 2) baseline GPT-4o, and 3) MedVAL GPT-4o. Under each example, MedVAL demonstrates higher agreement with the physician.
  • Figure S1: Performance benchmark (Cohen's $\kappa$). We rank all LMs (low to high). Baseline indicates zero-shot LM before distillation, Baseline (larger comparator) indicates a larger zero-shot LM as reference (not chosen for distillation), and MedVAL indicates LM after distillation. Notably, smaller MedVAL LMs match or exceed the performance of much larger baselines.
  • ...and 6 more figures