Table of Contents
Fetching ...

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

Zhixiu Lu, Hailong Li, Nehal A. Parikh, Jonathan R. Dillman, Lili He

TL;DR

RadCLIP addresses the gap in radiologic AI by adapting vision-language pre-training to 2D/3D radiologic data via a slice pooling adapter. It fuses a frozen CLIP text encoder with a fine-tuned 2D radiologic image encoder and a trainable slice pooling adapter to form 3D representations, optimized with a cross-modal objective using InfoNCE: $ \mathcal{L} = \mathcal{L}_{i,j} = - \log \frac{\exp(\text{sim}(\mathbf{V}_i, \mathbf{T}_j)/\tau)}{\sum_{k=1}^{N} \exp(\text{sim}(\mathbf{V}_i, \mathbf{T}_k)/\tau)} $. The volumetric embedding is computed as $ \mathbf{V} = \text{MHSA}(\mathbf{I} + \text{PE}(\mathbf{P})) $. The authors curate a large radiologic image-text dataset (over 1.1M 2D pairs and 52k 3D pairs) and show RadCLIP achieves state-of-the-art results on both unimodal classification and cross-modal matching, highlighting its potential for clinical diagnostic support and retrieval; limitations include modality coverage gaps and a fixed textual encoder, with avenues for extending modalities and text richness in future work.

Abstract

The integration of artificial intelligence (AI) with radiology marks a transformative era in medicine. Vision foundation models have been adopted to enhance radiologic imaging analysis. However, the distinct complexities of radiologic 2D and 3D radiologic data pose unique challenges that existing models, pre-trained on general non-medical images, fail to address adequately. To bridge this gap and capitalize on the diagnostic precision required in radiologic imaging, we introduce Radiologic Contrastive Language-Image Pre-training (RadCLIP): a cross-modal vision-language foundational model that harnesses Vision Language Pre-training (VLP) framework to improve radiologic image analysis. Building upon Contrastive Language-Image Pre-training (CLIP), RadCLIP incorporates a slice pooling mechanism tailored for volumetric image analysis and is pre-trained using a large and diverse dataset of radiologic image-text pairs. The RadCLIP was pre-trained to effectively align radiologic images with their corresponding text annotations, creating a robust vision backbone for radiologic images. Extensive experiments demonstrate RadCLIP's superior performance in both uni-modal radiologic image classification and cross-modal image-text matching, highlighting its significant promise for improving diagnostic accuracy and efficiency in clinical settings. Our Key contributions include curating a large dataset with diverse radiologic 2D/3D radiologic image-text pairs, a slice pooling adapter using an attention mechanism for integrating 2D images, and comprehensive evaluations of RadCLIP on various radiologic downstream tasks.

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

TL;DR

RadCLIP addresses the gap in radiologic AI by adapting vision-language pre-training to 2D/3D radiologic data via a slice pooling adapter. It fuses a frozen CLIP text encoder with a fine-tuned 2D radiologic image encoder and a trainable slice pooling adapter to form 3D representations, optimized with a cross-modal objective using InfoNCE: . The volumetric embedding is computed as . The authors curate a large radiologic image-text dataset (over 1.1M 2D pairs and 52k 3D pairs) and show RadCLIP achieves state-of-the-art results on both unimodal classification and cross-modal matching, highlighting its potential for clinical diagnostic support and retrieval; limitations include modality coverage gaps and a fixed textual encoder, with avenues for extending modalities and text richness in future work.

Abstract

The integration of artificial intelligence (AI) with radiology marks a transformative era in medicine. Vision foundation models have been adopted to enhance radiologic imaging analysis. However, the distinct complexities of radiologic 2D and 3D radiologic data pose unique challenges that existing models, pre-trained on general non-medical images, fail to address adequately. To bridge this gap and capitalize on the diagnostic precision required in radiologic imaging, we introduce Radiologic Contrastive Language-Image Pre-training (RadCLIP): a cross-modal vision-language foundational model that harnesses Vision Language Pre-training (VLP) framework to improve radiologic image analysis. Building upon Contrastive Language-Image Pre-training (CLIP), RadCLIP incorporates a slice pooling mechanism tailored for volumetric image analysis and is pre-trained using a large and diverse dataset of radiologic image-text pairs. The RadCLIP was pre-trained to effectively align radiologic images with their corresponding text annotations, creating a robust vision backbone for radiologic images. Extensive experiments demonstrate RadCLIP's superior performance in both uni-modal radiologic image classification and cross-modal image-text matching, highlighting its significant promise for improving diagnostic accuracy and efficiency in clinical settings. Our Key contributions include curating a large dataset with diverse radiologic 2D/3D radiologic image-text pairs, a slice pooling adapter using an attention mechanism for integrating 2D images, and comprehensive evaluations of RadCLIP on various radiologic downstream tasks.
Paper Structure (18 sections, 4 equations, 5 figures, 3 tables)

This paper contains 18 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: RadCLIP Model Architecture. (a) The framework integrates a frozen text encoder from CLIP with a fine-tuned 2D image encoder to extract rich radiologic features. (b) The slice pooling adapter then aggregates these 2D slice embeddings into a unified 3D volumetric representation using an attention mechanism that preserves spatial context. Together, these components enable effective crossmodal alignment between radiologic images and their corresponding text descriptions.
  • Figure 2: This diagram details our adapter that converts a stack of 2D slice embeddings into a unified 3D image representation. The adapter employs a multi-head self-attention mechanism to capture inter-slice dependencies and integrates learnable random positional encoding to embed spatial order.
  • Figure 3: Overview of the RadCLIP Datasets. This figure presents our comprehensive dataset, which includes 1,157,587 2D radiologic image–text pairs and 52,766 3D image–text pairs from 14 public sources. Representative samples illustrate the diversity in imaging modalities and anatomical regions used for training and evaluation.
  • Figure 4: Downstream Tasks Using RadCLIP. Top panels (Image Unimodal Classification) demonstrate the linear probing approach for image classification, where a single-layer classifier is trained on features extracted by RadCLIP. Bottom panels (Image–Text Crossmodal Alignment) illustrate the image–text matching setup using cosine similarity to align image embeddings with their corresponding textual descriptions.
  • Figure 5: Sample images from each benchmark dataset are paired with both correct modality labels (e.g., “Chest X-ray Image,” “Brain MRI Image,” “Chest CT Image”) and distractor labels (e.g., “A Puppy,” “A Cat,” “A Life Vest”). The accompanying bar charts show each model’s matching score for these text prompts. Higher scores indicate stronger alignment between the image and text.