Table of Contents
Fetching ...

UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou

TL;DR

This work tackles radiology report generation under limited labeled data by transferring knowledge from CLIP to the medical domain using UniCrossAdapter, a parameter-efficient multimodal adapter. The adapters are integrated into CLIP's text and image encoders and coupled via cross-attention, with a Transformer decoder generating reports from fused multimodal representations. Experiments on IU-Xray and MIMIC-CXR show state-of-the-art or competitive performance across standard metrics, with ablations validating the necessity of both the adapters and CLIP pre-training. The approach demonstrates the practical potential of leveraging large pre-trained multimodal models for data-scarce medical vision-language tasks while keeping computational demands manageable.

Abstract

Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.

UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

TL;DR

This work tackles radiology report generation under limited labeled data by transferring knowledge from CLIP to the medical domain using UniCrossAdapter, a parameter-efficient multimodal adapter. The adapters are integrated into CLIP's text and image encoders and coupled via cross-attention, with a Transformer decoder generating reports from fused multimodal representations. Experiments on IU-Xray and MIMIC-CXR show state-of-the-art or competitive performance across standard metrics, with ablations validating the necessity of both the adapters and CLIP pre-training. The approach demonstrates the practical potential of leveraging large pre-trained multimodal models for data-scarce medical vision-language tasks while keeping computational demands manageable.

Abstract

Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://github.com/chauncey-tow/MRG-CLIP.

Paper Structure

This paper contains 18 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (Left) Overall architecture of our method for radiology report generation, leveraging CLIP and the proposed UniCrossAdapter. (Right) Illustration of the interaction between the UniCrossAdapter and CLIP's text and image encoders.
  • Figure 2: Example of radiology report generation results on a test image from our model and its ablated variants. Ground truth words present in the generated reports are highlighted in color.