Table of Contents
Fetching ...

GREEN: Generative Radiology Report Evaluation and Error Notation

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck

TL;DR

GREEN tackles the challenge of evaluating radiology reports with a focus on factual correctness and clinical relevance. It introduces a generative, LLM-based evaluator that identifies and categorizes errors, defines a mathematically grounded GREEN score, and provides a detailed, interpretable GREEN summary of clinically significant errors. Validation against expert error counts and expert preferences shows GREEN achieves higher alignment with radiologist judgments than prior metrics, and it generalizes to multiple imaging modalities and out-of-distribution data. The approach is open-source, lightweight, and privacy-preserving, with demonstrated applicability across chest X-ray, other modalities, and OOD datasets, suggesting potential as a practical standard for automated radiology report evaluation.

Abstract

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.

GREEN: Generative Radiology Report Evaluation and Error Notation

TL;DR

GREEN tackles the challenge of evaluating radiology reports with a focus on factual correctness and clinical relevance. It introduces a generative, LLM-based evaluator that identifies and categorizes errors, defines a mathematically grounded GREEN score, and provides a detailed, interpretable GREEN summary of clinically significant errors. Validation against expert error counts and expert preferences shows GREEN achieves higher alignment with radiologist judgments than prior metrics, and it generalizes to multiple imaging modalities and out-of-distribution data. The approach is open-source, lightweight, and privacy-preserving, with demonstrated applicability across chest X-ray, other modalities, and OOD datasets, suggesting potential as a practical standard for automated radiology report evaluation.

Abstract

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.
Paper Structure (31 sections, 1 equation, 7 figures, 11 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Motivation of GREEN.
  • Figure 2: Training procedure of the GREEN model.
  • Figure 3: Number of RadGraph permutations among the candidates for 20,000 pairs (left) and BERTScore distribution across 20,000 pairs (right).
  • Figure 4: Sample GREEN summary. For each error subcategory, we provide the most representative error explanations, enabling users to pinpoint areas for improvement for their trained systems.
  • Figure 5: Radiologist confidence vs. accuracy of preference labeling. As the confidence of the experts in their preferences increases, the GREEN score demonstrates the highest alignment with expert preferences as compared to the approach of using just the summed error counts. This difference was quantified using accuracy (green lines). Of note, if GPT-4 is asked directly about a preference, it aligns poorly with the expert preference. However, when the GREEN score formula is applied, a higher accuracy is shown even at lower expert confidence levels. Detailed results can be found in Table \ref{['tab:preferences']}.
  • ...and 2 more figures