Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction
Kyoyun Choi, Byungmu Yoon, Soobum Kim, Jonggwon Park
TL;DR
This work addresses the high cost and data requirements of multimodal LLMs for chest X-ray radiology report generation by introducing RA-RRG, a retrieval-augmented framework that uses LLM-based key phrase extraction and a multimodal retriever to assemble concise, clinically meaningful content. Retrieval is guided by semantic embeddings via TranSQ and reinforced through in-batch contrastive learning, while key phrases fed to a generative LLM produce coherent reports with reduced hallucinations. On MIMIC-CXR, RA-RRG achieves state-of-the-art CheXbert F1 metrics and competitive RadGraph F1 without requiring LLM fine-tuning, and demonstrates robust generalization to multi-view settings, including the ability to extend to prior-study-assisted reporting. The approach offers practical benefits by lowering computational demands and enabling flexible RRG across single- and multi-view studies, with potential for broader clinical deployment and human-evaluated improvements in the future.
Abstract
Automated radiology report generation (RRG) holds potential to reduce radiologists' workload, especially as recent advancements in large language models (LLMs) enable the development of multimodal models for chest X-ray (CXR) report generation. However, multimodal LLMs (MLLMs) are resource-intensive, requiring vast datasets and substantial computational cost for training. To address these challenges, we propose a retrieval-augmented generation approach that leverages multimodal retrieval and LLMs to generate radiology reports while mitigating hallucinations and reducing computational demands. Our method uses LLMs to extract key phrases from radiology reports, effectively focusing on essential diagnostic information. Through exploring effective training strategies, including image encoder structure search, adding noise to text embeddings, and additional training objectives, we combine complementary pre-trained image encoders and adopt contrastive learning between text and semantic image embeddings. We evaluate our approach on MIMIC-CXR dataset, achieving state-of-the-art results on CheXbert metrics and competitive RadGraph F1 metric alongside MLLMs, without requiring LLM fine-tuning. Our method demonstrates robust generalization for multi-view RRG, making it suitable for comprehensive clinical applications.
