Table of Contents
Fetching ...

ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment

Ruochen Li, Jun Li, Bailiang Jian, Kun Yuan, Youxiang Zhu

TL;DR

This work tackles the problem that radiology report generation metrics often fail to reflect clinical usefulness. It introduces a clinically grounded Meta-Evaluation framework that assesses metrics on both alignment with clinical needs and core capabilities (discrimination, robustness, monotonicity) using a ground-truth plus rewritten (GT–ME) dataset derived from MIMIC-CXR and ReXVal. Through empirical benchmarking of standard NLP and medical-specific metrics, the authors reveal pervasive gaps in capturing clinical semantics and error impact, while demonstrating that no existing metric excels across all dimensions. The framework serves as a diagnostic tool to steer the design of more clinically reliable evaluation methods, with future directions including knowledge infusion, structured reasoning in evaluations, and scalable semi-automated data generation.

Abstract

Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians' trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.

ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment

TL;DR

This work tackles the problem that radiology report generation metrics often fail to reflect clinical usefulness. It introduces a clinically grounded Meta-Evaluation framework that assesses metrics on both alignment with clinical needs and core capabilities (discrimination, robustness, monotonicity) using a ground-truth plus rewritten (GT–ME) dataset derived from MIMIC-CXR and ReXVal. Through empirical benchmarking of standard NLP and medical-specific metrics, the authors reveal pervasive gaps in capturing clinical semantics and error impact, while demonstrating that no existing metric excels across all dimensions. The framework serves as a diagnostic tool to steer the design of more clinically reliable evaluation methods, with future directions including knowledge infusion, structured reasoning in evaluations, and scalable semi-automated data generation.

Abstract

Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians' trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.

Paper Structure

This paper contains 18 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Rewrite report contains a significant error or an insignificant error.
  • Figure 2: Discriminative Score and Robustness Score.
  • Figure 3: Monotonicity evaluation using five error severity groups (Group 0–4), ranging from stylistic variations to severe logical contradictions.
  • Figure 4: Metric scores vs. clinical error severity. Ideally, metric scores should decrease monotonically with increasing error severity.