Table of Contents
Fetching ...

Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, Tanveer Syeda-Mahmood

TL;DR

This work introduces a novel evaluation metric for automated chest X-ray radiology reports that jointly considers fine-grained textual findings and their phrasal grounding to anatomical regions. By extracting structured FFL patterns from both ground-truth and generated reports and aligning them with image-bound anatomical regions via a bipartite matching framework, the method computes a combined score $RQ(G,P) = \text{F1}(G,P) + \text{MIOU}(G,P)$ and averages it across samples. Evaluated on the ChestImagenome gold standard derived from MIMIC, the approach demonstrates robustness to factual errors and provides more balanced quality assessments than conventional lexical or clinical-accuracy metrics, highlighting its potential for reliable fact-checking of radiology reports. The results indicate meaningful spatial and descriptive overlap ranges (e.g., 36–48% spatial, 33–44% descriptive overlap) and reveal greater sensitivity to location-specific errors than some existing measures, supporting its practical utility in AI-assisted radiology report evaluation.

Abstract

Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.

Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

TL;DR

This work introduces a novel evaluation metric for automated chest X-ray radiology reports that jointly considers fine-grained textual findings and their phrasal grounding to anatomical regions. By extracting structured FFL patterns from both ground-truth and generated reports and aligning them with image-bound anatomical regions via a bipartite matching framework, the method computes a combined score and averages it across samples. Evaluated on the ChestImagenome gold standard derived from MIMIC, the approach demonstrates robustness to factual errors and provides more balanced quality assessments than conventional lexical or clinical-accuracy metrics, highlighting its potential for reliable fact-checking of radiology reports. The results indicate meaningful spatial and descriptive overlap ranges (e.g., 36–48% spatial, 33–44% descriptive overlap) and reveal greater sensitivity to location-specific errors than some existing measures, supporting its practical utility in AI-assisted radiology report evaluation.

Abstract

Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.

Paper Structure

This paper contains 5 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) Illustration of the report quality problem. (a) Original image. (b) Ground truth report (Findings and Impressions only). (c) Fine-grained (FFL) patterns extracted from report of (b). (d) Anatomical locations of findings identified in (c) shown through bounding boxes. (e) Automated report produced by GPT-4. (f) FFL patterns extracted from automated report of (e). The table below shows the report evaluation scores produced by methods described in text for the automated report of (e).
  • Figure 2: Illustration of the overall approach to computing the report quality score.