Table of Contents
Fetching ...

Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models

Thuraya Alzubaidi, Farhad R. Nezami, Muzammil Behzad

TL;DR

This work tackles the scarcity of labeled data and computational limits in applying vision-language models to volumetric CT imaging. It proposes MedCT-VLM, a parameter-efficient framework that injects CrossModal-LoRA adapters into both the CT-CLIP vision encoder and the radiology text encoder, freezing the base model and training only 1.67M parameters. On 18 thoracic pathologies, the approach boosts zero-shot AUROC from 61.3% to 68.9% and improves accuracy and macro-F1, while reducing checkpoint size by about 74×. The results demonstrate that targeted, low-rank adapters can effectively transfer large-scale pretraining to 3D medical imaging tasks, enabling deployable, multi-task clinical VLMs with limited labeled data.

Abstract

Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model's 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38\% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3\% to 68.9\% (+7.6 pp), accuracy from 67.2\% to 73.6\% (+6.4 pp), and macro-F1 from 32.1\% to 36.9\% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.

Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models

TL;DR

This work tackles the scarcity of labeled data and computational limits in applying vision-language models to volumetric CT imaging. It proposes MedCT-VLM, a parameter-efficient framework that injects CrossModal-LoRA adapters into both the CT-CLIP vision encoder and the radiology text encoder, freezing the base model and training only 1.67M parameters. On 18 thoracic pathologies, the approach boosts zero-shot AUROC from 61.3% to 68.9% and improves accuracy and macro-F1, while reducing checkpoint size by about 74×. The results demonstrate that targeted, low-rank adapters can effectively transfer large-scale pretraining to 3D medical imaging tasks, enabling deployable, multi-task clinical VLMs with limited labeled data.

Abstract

Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model's 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38\% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3\% to 68.9\% (+7.6 pp), accuracy from 67.2\% to 73.6\% (+6.4 pp), and macro-F1 from 32.1\% to 36.9\% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Dataset class distribution across 18 thoracic pathologies.
  • Figure 2: Random samples of CT slices before and after augmentation
  • Figure 3: High-level architecture of the proposed model.
  • Figure 4: Overall zero-shot metrics for Base model vs. MedCT-VLM.
  • Figure 5: Overall metrics (radar): proposed model outperforms baseline across accuracy, F1 variants, samples-F1, and mean AUROC.
  • ...and 1 more figures