ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

Oishi Banerjee; Agustina Saenz; Kay Wu; Warren Clements; Adil Zia; Dominic Buensalido; Helen Kavnoudias; Alain S. Abi-Ghanem; Nour El Ghawi; Cibele Luna; Patricia Castillo; Khaled Al-Surimi; Rayyan A. Daghistani; Yuh-Min Chen; Heng-sheng Chao; Lars Heiliger; Moon Kim; Johannes Haubold; Frederic Jonske; Pranav Rajpurkar

ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

Oishi Banerjee, Agustina Saenz, Kay Wu, Warren Clements, Adil Zia, Dominic Buensalido, Helen Kavnoudias, Alain S. Abi-Ghanem, Nour El Ghawi, Cibele Luna, Patricia Castillo, Khaled Al-Surimi, Rayyan A. Daghistani, Yuh-Min Chen, Heng-sheng Chao, Lars Heiliger, Moon Kim, Johannes Haubold, Frederic Jonske, Pranav Rajpurkar

TL;DR

ReXamine-Global addresses the problem of evaluating radiology report generation metrics across diverse hospitals and styles. It introduces a pragmatic, LLM-powered framework that standardizes ground-truth reports, generates error-containing candidates with GPT-4, and subjects seven automatic metrics to cross-site testing against expert judgments. The study finds that most metrics exhibit undesired stylistic sensitivity and weak agreement with experts across sites, though the FineRadScore-GPT-4 approach shows comparatively stronger alignment. The results underscore the need for robust, cross-site evaluation procedures and offer guidance for selecting or designing metrics that generalize to target clinical settings.

Abstract

Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.

ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

TL;DR

Abstract

Paper Structure (16 sections, 3 figures, 4 tables)

This paper contains 16 sections, 3 figures, 4 tables.

Introduction
Methods
Generation of Candidate Radiology Reports Using GPT-4
Automatic Metrics
Expert Evaluation
Experiments
Results
Effect of Stylistic Differences on Metric Scores
Correlation of Automatic Metrics with Expert Scores on Stylistically Diverse Reports
Correlation of Automatic Metrics with Expert Scores on Stylistically Standardized Reports
Discussion
Limitations
Institutional Review Board (IRB)
Data Contribution
Acknowledgments
...and 1 more sections

Figures (3)

Figure 1: ReXamine-Global tests how metrics generalize when used across distributions, with the goal of uncovering two failure modes. First, we test whether automatic metrics are undesirably sensitive to clinically irrelevant differences in report style, providing different scores depending on whether candidates are stylistically similar to the ground truths. Next, we test whether metrics disagree with expert scores, providing unreliable judgments at some sites. A successful metric would avoid both failure modes.
Figure 2: Using GPT-4, we first standardized the style of the ground-truth reports and then introduced errors to create AI candidates. For details on our prompts, please see Appendix A.
Figure 3: Except for FineRadScore-GPT-4, no metric achieved positive Spearman correlations with expert scores at every site, indicating poor generalization. Correlations for original ground-truth reports are shown in the black box plots (left). Correlations for standardized ground-truth reports are shown in blue box plots (right). Metrics typically achieved higher performance with standardized ground-truth reports than original ground-truth reports. For detailed numerical results, see the table in Appendix C.

ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

TL;DR

Abstract

ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

Authors

TL;DR

Abstract

Table of Contents

Figures (3)