Table of Contents
Fetching ...

Few-shot Adaptation of Medical Vision-Language Models

Fereshteh Shakeri, Yunshi Huang, Julio Silva-Rodríguez, Houda Bahig, An Tang, Jose Dolz, Ismail Ben Ayed

TL;DR

The paper addresses few-shot adaptation of medical vision-language models, addressing data scarcity and privacy constraints in clinical contexts. It introduces a structured few-shot benchmark for adapting medical VLMs across three modalities (histology, ophthalmology, radiology) and nine downstream tasks, using up to 16 labeled samples per class. The authors propose LP+text, a simple text-informed linear-probe that learns class-wise multipliers to blend visual prototypes with text embeddings, achieving competitive or superior accuracy with much lower computation and enabling black-box deployment. Across extensive experiments, LP+text outperforms prompt-learning baselines and matches or surpasses black-box adapters while preserving efficiency. The work provides public benchmark and code to spur progress in few-shot medical multimodal learning.

Abstract

Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.

Few-shot Adaptation of Medical Vision-Language Models

TL;DR

The paper addresses few-shot adaptation of medical vision-language models, addressing data scarcity and privacy constraints in clinical contexts. It introduces a structured few-shot benchmark for adapting medical VLMs across three modalities (histology, ophthalmology, radiology) and nine downstream tasks, using up to 16 labeled samples per class. The authors propose LP+text, a simple text-informed linear-probe that learns class-wise multipliers to blend visual prototypes with text embeddings, achieving competitive or superior accuracy with much lower computation and enabling black-box deployment. Across extensive experiments, LP+text outperforms prompt-learning baselines and matches or surpasses black-box adapters while preserving efficiency. The work provides public benchmark and code to spur progress in few-shot medical multimodal learning.

Abstract

Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.
Paper Structure (6 sections, 2 equations, 1 figure, 2 tables)

This paper contains 6 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Comparison of different adaptation methods of Medical VLMs evaluated on 9 benchmarks, averaged over 5 tasks.