Table of Contents
Fetching ...

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz, Dan Jurafsky

TL;DR

Radiology report generation often achieves strong conventional NLG scores but lacks factual completeness and consistency. The authors introduce two domain-specific rewards, fact_ENT and fact_ENTNLI, and a weakly supervised radiology NLI, integrating them into a $\mathcal{M}^2$ Trans-based generator trained with reinforcement learning and BERTScore. They demonstrate substantial gains in clinical metrics (e.g., CheXbert F1) on MIMIC-CXR and Open-i, with human evaluators noting improved factual quality; correlations suggest the rewards can serve as proxies for clinical accuracy. The work underscores the value of optimizing factual completeness and consistency, offering a framework likely transferable to other data-to-text tasks.

Abstract

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9%). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

TL;DR

Radiology report generation often achieves strong conventional NLG scores but lacks factual completeness and consistency. The authors introduce two domain-specific rewards, fact_ENT and fact_ENTNLI, and a weakly supervised radiology NLI, integrating them into a Trans-based generator trained with reinforcement learning and BERTScore. They demonstrate substantial gains in clinical metrics (e.g., CheXbert F1) on MIMIC-CXR and Open-i, with human evaluators noting improved factual quality; correlations suggest the rewards can serve as proxies for clinical accuracy. The work underscores the value of optimizing factual completeness and consistency, offering a framework likely transferable to other data-to-text tasks.

Abstract

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9%). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

Paper Structure

This paper contains 31 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A (partial) example of a report generated from our system (with "…" representing abbreviated text). The system encodes images and generates text from that encoded representation. Underlined words are disease and anatomy entities. The shaded sentences are an example of a contradictory pair.
  • Figure 2: An overview of Meshed-Memory Transformer extended to multiple images.
  • Figure 3: An example of radiology reports generated by R2Gen and by the proposed model with the optimization integrating BERTScore. Repeated sentences are removed from the example to improve readability.
  • Figure 4: Examples of radiology reports generated by the proposed model with the optimization integrating BERTScore and $\mathrm{fact_{ENTNLI}}$. Repeated sentences are removed from the examples to improve readability.