Table of Contents
Fetching ...

DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

Sang-Jun Park, Keun-Soo Heo, Dong-Hee Shin, Young-Han Son, Ji-Hye Oh, Tae-Eui Kam

TL;DR

DART tackles trustworthy radiology report generation by coupling disease-aware image-text alignment with a self-correcting re-alignment mechanism. It first retrieves disease-relevant text using a contrastively learned embedding space and a disease-matching constraint, then generates reports from retrieved content and disease features. A second-stage self-correction module re-aligns the generated report within the embedding space to further reduce omissions and improve clinical fidelity, trained with a dedicated correction loss. The approach achieves state-of-the-art performance on MIMIC-CXR and IU X-ray across descriptive NLG metrics and clinical efficacy evaluations, demonstrating improved trustworthiness and potential to alleviate radiologists’ workload.

Abstract

The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.

DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

TL;DR

DART tackles trustworthy radiology report generation by coupling disease-aware image-text alignment with a self-correcting re-alignment mechanism. It first retrieves disease-relevant text using a contrastively learned embedding space and a disease-matching constraint, then generates reports from retrieved content and disease features. A second-stage self-correction module re-aligns the generated report within the embedding space to further reduce omissions and improve clinical fidelity, trained with a dedicated correction loss. The approach achieves state-of-the-art performance on MIMIC-CXR and IU X-ray across descriptive NLG metrics and clinical efficacy evaluations, demonstrating improved trustworthiness and potential to alleviate radiologists’ workload.

Abstract

The automatic generation of radiology reports has emerged as a promising solution to reduce a time-consuming task and accurately capture critical disease-relevant findings in X-ray images. Previous approaches for radiology report generation have shown impressive performance. However, there remains significant potential to improve accuracy by ensuring that retrieved reports contain disease-relevant findings similar to those in the X-ray images and by refining generated reports. In this study, we propose a Disease-aware image-text Alignment and self-correcting Re-alignment for Trustworthy radiology report generation (DART) framework. In the first stage, we generate initial reports based on image-to-text retrieval with disease-matching, embedding both images and texts in a shared embedding space through contrastive learning. This approach ensures the retrieval of reports with similar disease-relevant findings that closely align with the input X-ray images. In the second stage, we further enhance the initial reports by introducing a self-correction module that re-aligns them with the X-ray images. Our proposed framework achieves state-of-the-art results on two widely used benchmarks, surpassing previous approaches in both report generation and clinical efficacy metrics, thereby enhancing the trustworthiness of radiology reports.

Paper Structure

This paper contains 18 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An overview of our proposed framework, which consists of two stages: (1) report generation based on disease-aware image-text alignment and (2) self-correcting re-alignment of generated reports. In the first stage, our proposed framework generates initial reports by text features, disease-relevant features, and retrieved text features that are closely aligned with image features in an embedding space. In the second stage, a self-correction mechanism refines the generated reports by re-aligning them within the embedding space to further enhance consistency with the input images.
  • Figure 2: A qualitative analysis of reports for a sample from the MIMIC-CXR dataset is presented. The top row displays an image set from two different views alongside a generated report from our proposed framework without the self-correction module ("w/o Self-Correction”). We further attempt to refine the generated report of "w/o Self-Correction” using GPT-4 achiam2023gpt ("Correction by GPT-4") to compare it with the generated report from our proposed framework with self-correction ("Ours”). The bottom row shows the ground-truth report and the top-$3$ retrieved texts from image-to-text retrieval. Key findings are highlighted in different colors for clarity.
  • Figure 3: A visualization of the generated reports and attention maps from the baseline model (BASE) and our proposed framework (Ours) on one sample from the MIMIC-CXR dataset. The attention maps, visualized using Grad-CAM selvaraju2017grad, illustrate the regions that BASE and Ours focuses on according to three keywords "heart,” "lung,” and "focal consolidation,” with each keyword highlighted in a different color.
  • Figure 4: An additional qualitative analysis of reports for three samples from the MIMIC-CXR dataset is presented. The top row of each sample displays an image set from two different views alongside a generated report from our proposed framework without the self-correction module ("w/o Self-Correction”). We further attempted to refine the generated report of "w/o Self-Correction” using GPT-4 achiam2023gpt ("Correction by GPT-4") to compare it with the generated report from our proposed framework with self-correction ("Ours”). The bottom row shows the ground-truth report and the Top-3 retrieved texts from image-to-text retrieval. Key findings are highlighted in different colors for clarity.
  • Figure 5: Visualizations of the generated reports and attention maps from the baseline model (BASE) and our proposed framework (Ours) on two samples from the MIMIC-CXR dataset. The attention maps, visualized using Grad-CAM selvaraju2017grad, illustrate the regions that BASE and Ours focuses on according to keywords such as "heart,” "lung,” "pneumothorax,” and "focal consolidation,” with each keyword highlighted in different colors.
  • ...and 1 more figures