UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
TL;DR
This work tackles radiology report generation under limited labeled data by transferring knowledge from CLIP to the medical domain using UniCrossAdapter, a parameter-efficient multimodal adapter. The adapters are integrated into CLIP's text and image encoders and coupled via cross-attention, with a Transformer decoder generating reports from fused multimodal representations. Experiments on IU-Xray and MIMIC-CXR show state-of-the-art or competitive performance across standard metrics, with ablations validating the necessity of both the adapters and CLIP pre-training. The approach demonstrates the practical potential of leveraging large pre-trained multimodal models for data-scarce medical vision-language tasks while keeping computational demands manageable.
Abstract
Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.
