Table of Contents
Fetching ...

On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

Raza Imam, Rufael Marew, Mohammad Yaqub

TL;DR

The paper addresses the vulnerability of Medical Vision-Language Models to realistic imaging corruptions, introducing MediMeta-C and MedMNIST-C to benchmark robustness across multiple modalities. It proposes RobustMedCLIP, a parameter-efficient, few-shot LoRA-based adaptation that enhances resilience while preserving cross-modality generalization. Across five MVLMs and five imaging modalities, the study reveals significant robustness gaps in existing models and demonstrates that targeted, low-rank adaptation with diverse modality exposure can substantially improve reliability under corruptions. This framework enables standardized robustness evaluation for practical clinical deployment and guides design toward diverse, corruption-aware training strategies.

Abstract

Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.

On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

TL;DR

The paper addresses the vulnerability of Medical Vision-Language Models to realistic imaging corruptions, introducing MediMeta-C and MedMNIST-C to benchmark robustness across multiple modalities. It proposes RobustMedCLIP, a parameter-efficient, few-shot LoRA-based adaptation that enhances resilience while preserving cross-modality generalization. Across five MVLMs and five imaging modalities, the study reveals significant robustness gaps in existing models and demonstrates that targeted, low-rank adaptation with diverse modality exposure can substantially improve reliability under corruptions. This framework enables standardized robustness evaluation for practical clinical deployment and guides design toward diverse, corruption-aware training strategies.

Abstract

Medical Vision-Language Models (MVLMs) have achieved par excellence generalization in medical image analysis, yet their performance under noisy, corrupted conditions remains largely untested. Clinical imaging is inherently susceptible to acquisition artifacts and noise; however, existing evaluations predominantly assess generally clean datasets, overlooking robustness -- i.e., the model's ability to perform under real-world distortions. To address this gap, we first introduce MediMeta-C, a corruption benchmark that systematically applies several perturbations across multiple medical imaging datasets. Combined with MedMNIST-C, this establishes a comprehensive robustness evaluation framework for MVLMs. We further propose RobustMedCLIP, a visual encoder adaptation of a pretrained MVLM that incorporates few-shot tuning to enhance resilience against corruptions. Through extensive experiments, we benchmark 5 major MVLMs across 5 medical imaging modalities, revealing that existing models exhibit severe degradation under corruption and struggle with domain-modality tradeoffs. Our findings highlight the necessity of diverse training and robust adaptation strategies, demonstrating that efficient low-rank adaptation when paired with few-shot tuning, improves robustness while preserving generalization across modalities.

Paper Structure

This paper contains 10 sections, 8 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Corrupted samples from our MediMeta-C dataset. The y-axis shows dataset names by modality and the x-axis displays corruption types at a fixed severity level.
  • Figure 2: Comparison of average DCT frequency distributions across datasets. Medical images generally exhibit higher density of low-frequency content compared to natural images and vice-versa xu2019systematic. Among the two, MediMeta-C (a) more clearly demonstrates this assumption than MedMNIST-C (b).
  • Figure 3: t-SNE visualization of the clean and corrupted feature distributions, showing how the distributions shift occur at the latent-level due to introduced corruption. MediMeta-C's Corrupted features differ notably than MediMeta's Clean features. Here RN50 backbone is used to extract features.
  • Figure 4: Benchmarking protocol used in our evaluation, where clean samples represent In-Distribution data seen by $\mathbb{R}$MC, while corrupted samples correspond to Out-Distribution shifts. Sampling refers to selecting the testset from each dataset.
  • Figure 5: A) Few-shot samples from each modality are drawn from the clean training set to adapt the LoRA-augmented image encoder of the pretrained BioMedCLIP. B) Low-rank attention matrices within the image encoder are updated using Eq. \ref{['eq:loss_ft']}, enabling the model to learn from diverse in-distribution modalities while retaining pretrained knowledge. C) Pixel-level density distributions comparing Clean and Corrupted samples under (a) brightness and (b) contrast corruptions, highlighting input-level distributional shifts. D) (a) Top-1 Accuracy as a measure of generalization, and (b) mean Corruption Error (mCE) as a proxy for robustness, averaged over MediMeta-C and MedMNIST-C. All values are normalized for visual comparability across models.
  • ...and 10 more figures