Table of Contents
Fetching ...

Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training

Jiuming Qin, Che Liu, Sibo Cheng, Yike Guo, Rossella Arcucci

TL;DR

This work tackles the computational and knowledge-retention drawbacks of end-to-end medical vision-language pre-training by freezing both the image and text backbones and introducing a lightweight Adaptor that uses cross-attention to fuse modalities. The Adaptor is trained with a symmetric contrastive objective, aligning image-text embeddings in a shared space while keeping backbones frozen. Empirical results on classification and segmentation across multiple datasets show competitive performance with >90% fewer trainable parameters and particularly strong results in low-data regimes (1%). The approach demonstrates robust cross-modal fusion without fine-tuning large backbones, offering a practical, parameter-efficient path for medical VL-SSL.

Abstract

Modern healthcare often utilises radiographic images alongside textual reports for diagnostics, encouraging the use of Vision-Language Self-Supervised Learning (VL-SSL) with large pre-trained models to learn versatile medical vision representations. However, most existing VL-SSL frameworks are trained end-to-end, which is computation-heavy and can lose vital prior information embedded in pre-trained encoders. To address both issues, we introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen, and employs a lightweight Adaptor module for cross-modal learning. Experiments on medical image classification and segmentation tasks across three datasets reveal that our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches. Notably, when fine-tuned with just 1% of data, Adaptor outperforms several Transformer-based methods trained on full datasets in medical image segmentation.

Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training

TL;DR

This work tackles the computational and knowledge-retention drawbacks of end-to-end medical vision-language pre-training by freezing both the image and text backbones and introducing a lightweight Adaptor that uses cross-attention to fuse modalities. The Adaptor is trained with a symmetric contrastive objective, aligning image-text embeddings in a shared space while keeping backbones frozen. Empirical results on classification and segmentation across multiple datasets show competitive performance with >90% fewer trainable parameters and particularly strong results in low-data regimes (1%). The approach demonstrates robust cross-modal fusion without fine-tuning large backbones, offering a practical, parameter-efficient path for medical VL-SSL.

Abstract

Modern healthcare often utilises radiographic images alongside textual reports for diagnostics, encouraging the use of Vision-Language Self-Supervised Learning (VL-SSL) with large pre-trained models to learn versatile medical vision representations. However, most existing VL-SSL frameworks are trained end-to-end, which is computation-heavy and can lose vital prior information embedded in pre-trained encoders. To address both issues, we introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen, and employs a lightweight Adaptor module for cross-modal learning. Experiments on medical image classification and segmentation tasks across three datasets reveal that our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches. Notably, when fine-tuned with just 1% of data, Adaptor outperforms several Transformer-based methods trained on full datasets in medical image segmentation.
Paper Structure (7 sections, 4 equations, 3 figures, 3 tables)

This paper contains 7 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The Adaptor framework. Note that the duplicated cross-attention and feedforward blocks are identical, only shown this way to demonstrate the different choices of KVQ vectors in the attention mechanism for two modalities. Blue model blocks are frozen during both pre-train and finetune, while yellow and grey blocks are updated during pre-train and downstream task evaluation stages respectively.
  • Figure 2: Number of trainable parameters v.s. performance on RSNA classification. The size of the data points also reflect the number of trainable parameters.
  • Figure 3: T-SNE Visualisation of vision embeddings from the COVIDx test dataset, before and after Adaptor module.