Table of Contents
Fetching ...

Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

Junhyeok Lee, Yujin Oh, Dahyoun Lee, Hyon Keun Joh, Chul-Ho Sohn, Sung Hyun Baik, Cheol Kyu Jung, Jung Hyun Park, Kyu Sung Choi, Byung-Hoon Kim, Jong Chul Ye

TL;DR

PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports.

Abstract

Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.

Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

TL;DR

PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports.

Abstract

Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.

Paper Structure

This paper contains 19 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Graphical illustration of the proposed method for image-to-text retrieval. a) Conventional methods require training an image encoder and/or a text encoder to align similar image-text pairs at the representation space, which a significant difficulty comes from learning the cross-domain joint distribution. b) The proposed method requires training only an image encoder to minimize the distance of similar images at the representation space. The relevant text is obtained from the retrieved similar image, which holds paired ground-truth text data that serve as the retrieved text.
  • Figure 2: Schematic illustration of the proposed method. a) Overview of training the 3D MRI image encoder. In the first stage, the 3D MRI input images are masked and encoded with a 3D ViT encoder, then decoded to reconstruct the original input 3D MRI image. In the second stage, the pretrained image encoder is fine-tuned to classify the four ischemic territories. b) Overview of the proposed cross-modal RAG framework PIRTA. Parameters of the image encoder are frozen, and the most relevant images and their paired ground-truth radiology reports are retrieved based on the cosine similarity in the image representation space. Retrieved radiology reports are used to augment and improve factuality of the final generated radiology report.
  • Figure 3: Quantitative performance of the proposed method compared to baseline. SNUH+SNUBH are internal datasets, in which the separate train set data from the same institutions were used for SFT, while BRMH and ISLES are external datasets.
  • Figure 4: Retrieval results on three classes. five scans are retrieved for each query. The first column displays the query image, while the following columns present the retrieved scans for each query, arranged in rank order. A red contour in MRI scans signifies a stroke infarction lesion. False retrieval is indicated by a red bounding box, while True retrieval is indicated by a green bounding box. "No" indicates no pretraining, "Small" indicates pretraining with the small dataset, and "Large" indicates pretraining with the large dataset.
  • Figure 5: The 768-dimensional feature vectors derived image encoder are projected into a 2D manifold using UMAP. Each stroke lesion type is color-coded. "No" indicates no pretraining, "Small" indicates pretraining with the small dataset, and "Large" indicates pretraining with the large dataset.