Table of Contents
Fetching ...

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

Kyoyun Choi, Byungmu Yoon, Soobum Kim, Jonggwon Park

TL;DR

This work addresses the high cost and data requirements of multimodal LLMs for chest X-ray radiology report generation by introducing RA-RRG, a retrieval-augmented framework that uses LLM-based key phrase extraction and a multimodal retriever to assemble concise, clinically meaningful content. Retrieval is guided by semantic embeddings via TranSQ and reinforced through in-batch contrastive learning, while key phrases fed to a generative LLM produce coherent reports with reduced hallucinations. On MIMIC-CXR, RA-RRG achieves state-of-the-art CheXbert F1 metrics and competitive RadGraph F1 without requiring LLM fine-tuning, and demonstrates robust generalization to multi-view settings, including the ability to extend to prior-study-assisted reporting. The approach offers practical benefits by lowering computational demands and enabling flexible RRG across single- and multi-view studies, with potential for broader clinical deployment and human-evaluated improvements in the future.

Abstract

Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

TL;DR

This work addresses the high cost and data requirements of multimodal LLMs for chest X-ray radiology report generation by introducing RA-RRG, a retrieval-augmented framework that uses LLM-based key phrase extraction and a multimodal retriever to assemble concise, clinically meaningful content. Retrieval is guided by semantic embeddings via TranSQ and reinforced through in-batch contrastive learning, while key phrases fed to a generative LLM produce coherent reports with reduced hallucinations. On MIMIC-CXR, RA-RRG achieves state-of-the-art CheXbert F1 metrics and competitive RadGraph F1 without requiring LLM fine-tuning, and demonstrates robust generalization to multi-view settings, including the ability to extend to prior-study-assisted reporting. The approach offers practical benefits by lowering computational demands and enabling flexible RRG across single- and multi-view studies, with potential for broader clinical deployment and human-evaluated improvements in the future.

Abstract

Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.

Paper Structure

This paper contains 43 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: (a) A simplified illustration of our method. (b) Single-view RRG performance comparison between MLLMs and our model. The x-axis shows the number of parameters of fine-tuned LLMs, with 0B indicating no fine-tuning.
  • Figure 2: (a) Key phrase extraction using an LLM. (b) The multimodal retriever architecture. (c) Inference process of RA-RRG.
  • Figure 3: Example of single-view RRG. The baseline is model E1 from Table \ref{['tab:Ablation']}. Positive findings are highlighted in yellow, and hallucinations are marked in red.
  • Figure 4: Example of multi-view RRG. At the top are the frontal and lateral images with their predicted key phrases. Below the original report, two radiology reports are generated: 1) using only the frontal view, and 2) using both the frontal and lateral views (multi-view). Content present in the original report but visible only in the lateral view is highlighted in yellow.
  • Figure 5: Impact of threshold on example-based average CheXbert scores and the number of key phrases.
  • ...and 8 more figures