MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali; Vasiliki Bikia; Maya Varma; Nicole Chiou; Sophie Ostmeier; Arnav Singhvi; Magdalini Paschali; Ashwin Kumar; Andrew Johnston; Karimar Amador-Martinez; Eduardo Juan Perez Guerrero; Paola Naovi Cruz Rivera; Sergios Gatidis; Christian Bluethgen; Eduardo Pontes Reis; Eddy D. Zandee van Rilland; Poonam Laxmappa Hosamani; Kevin R Keet; Minjoung Go; Evelyn Ling; David B. Larson; Curtis Langlotz; Roxana Daneshjou; Jason Hom; Sanmi Koyejo; Emily Alsentzer; Akshay S. Chaudhari

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari

TL;DR

MedVAL introduces a self-supervised distillation framework to train evaluators that validate LM-generated medical text for factual consistency without physician labels or reference outputs. By generating synthetic perturbations of outputs at controlled degradation levels, filtering via a consistency metric, and fine-tuning with a small, high-quality dataset, MedVAL strengthens risk-based validation across six medical tasks and multilingual settings. On MedVAL-Bench (840 physician-annotated outputs), MedVAL improves average four-class F1 from 0.367 to 0.510 and safe/unsafe F1 from 0.662 to 0.828 across 10 LMs, with GPT-4o achieving 0.587 post-distillation and showing non-inferiority to a single human expert on a subset. The work demonstrates robust generalization to unseen tasks and distribution shifts, and provides open-source code, datasets, and a high-quality open-model MedVAL-4B to support scalable, privacy-preserving clinical validation of LM-generated medical text.

Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

TL;DR

Abstract

MedVAL: Toward Expert-Level Medical Text Validation with Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)