Table of Contents
Fetching ...

Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights

Yuan Zhong, Ruinan Jin, Qi Dou, Xiaoxiao Li

TL;DR

It is found that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities.

Abstract

Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.

Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights

TL;DR

It is found that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities.

Abstract

Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.

Paper Structure

This paper contains 21 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of the MedVLMBench. a. Distribution of the benchmarking corpus across two tasks (i.e., diagnosis and VQA), four imaging modalities, and ten datasets. Percentages indicate each dataset’s share of the total samples. b. Evaluation settings and definition of RQ gaps. Our study includes both off-the-shelf VLMs and task-specific VLMs obtained via parameter-efficient fine-tuning (PEFT) on individual datasets. We also consider both in-distribution (ID) and out-of-distribution (OOD) imaging modalities for different RQ analyses. The definition of ID and OOD can be found in Sec. \ref{['sec:datasets']}. c. The architectures of contrastive VLMs employed for disease diagnosis tasks, along with the two fine-tuning strategies incorporated therein: low-rank adaptation (LoRA) and linear probing. d. The architecture of generative VLMs used for medical VQA tasks and the LoRA-based SFT fine-tuning strategy.
  • Figure 2: Benchmarking results of disease diagnosis and VQA tasks. a. Disease diagnosis with contrastive VLMs. Six CLIP-family models and two SigLIP-family models are included. For each dataset, we report the OTS and LP performance (metric: AUROC). b. VQA with generative VLMs. Multiple open-source VLM from diverse families and two proprietary commercial VLMs (o3 and Gemini 2.5 Pro) are included. For each dataset, we report OTS and SFT performance (metric: overall GPT score). In a and b, we use blue to denote generalist VLMs and use green to denote medical VLMs. The dagger ($\mathbf{\dagger}$) marks datasets whose imaging modality was seen during the corresponding model’s pre-training, i.e, ID for that VLM. All estimates are accompanied by uncertainty computed via nonparametric bootstrapping with 1,000 replicates.
  • Figure 3: Comparison between generalist VLMs and their counterpart specialist medical VLMs. a. Disease diagnosis. Metric: AUROC. b. VQA. Metric: overall GPT Score. Within each model family, RQ gaps are computed using the best-performing generalist and specialist medical VLMs. A negative RQ1 gap (green line) reflects a specialist advantage in the OTS setting; a positive RQ2 gap (blue line) indicates that generalists surpass specialists after light-weight adaptation; and a positive RQ3 gap (pink line) denotes superior generalization of generalists on OOD tasks. All estimates are accompanied by uncertainty computed via nonparametric bootstrapping with 1,000 replicates.
  • Figure 4: Comparison between generalist VLMs and the best-performing specialist medical VLMs. a. Disease diagnosis. Metric: AUROC. b. VQA. Metric: overall GPT Score. The bars indicate the performance of the generalist VLMs. The dark and light dashed lines denote the performance of the best and second-best specialist VLMs, respectively. The OTS and fine-tuned performances for those VLMs are distinguished by green and purple, respectively. For medical VLMs, we report OTS performance on ID data and fine-tuned performance on OOD data. Note that when computing the RQ gap, we only consider OTS medical models on ID data and fine-tuned medical models on OOD data. If a certain dataset has no ID or OOD medical models, the corresponding dashed line will be absent. The error bars show 95% confidence intervals. All estimates are accompanied by uncertainty computed via nonparametric bootstrapping with 1,000 replicates.
  • Figure 5: Benchmarking results of VQA tasks with overall F1 score. a, VQA results with generative VLMs. b, Family-wise RQ gap results for disease and VQA. The metrics in the figure is the overall tokenized F1 score. All estimates are accompanied by uncertainty computed via nonparametric bootstrapping with 1,000 replicates.