Table of Contents
Fetching ...

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao

TL;DR

BiomedCoOp tackles the challenge of adapting biomedical vision-language systems under data scarcity by shifting from full fine-tuning to prompt learning. It combines Semantic Consistency by Contextual Mapping with Knowledge Distillation via Selective Prompting to learn robust, GPT-derived prompts for BiomedCLIP, enabling strong few-shot and base-to-novel generalization. The approach is validated on 11 datasets across 9 modalities and 10 organs, showing substantial accuracy gains over state-of-the-art prompt-learning methods and improved interpretability through context analysis and saliency maps. This work demonstrates a data-efficient, generalizable path toward deploying biomedical vision-language capabilities across diverse clinical imaging tasks.

Abstract

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code is publicly available at https://github.com/HealthX-Lab/BiomedCoOp.

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

TL;DR

BiomedCoOp tackles the challenge of adapting biomedical vision-language systems under data scarcity by shifting from full fine-tuning to prompt learning. It combines Semantic Consistency by Contextual Mapping with Knowledge Distillation via Selective Prompting to learn robust, GPT-derived prompts for BiomedCLIP, enabling strong few-shot and base-to-novel generalization. The approach is validated on 11 datasets across 9 modalities and 10 organs, showing substantial accuracy gains over state-of-the-art prompt-learning methods and improved interpretability through context analysis and saliency maps. This work demonstrates a data-efficient, generalizable path toward deploying biomedical vision-language capabilities across diverse clinical imaging tasks.

Abstract

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code is publicly available at https://github.com/HealthX-Lab/BiomedCoOp.

Paper Structure

This paper contains 30 sections, 11 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overview of the BiomedCoOp framework, which combines LLM queries, learnable context tokens, and BiomedCLIP to generate multi-modal representations for biomedical tasks. The method integrates text and image features using prompt ensembling strategies, minimizes cross-entropy and semantic differences, and aligns teacher-student logits, enabling effective few-shot learning for novel biomedical categories.
  • Figure 2: Barplot to compare classification accuracies (%) of different CLIP-based backbone models in BiomedCoOP across various few-shot settings.
  • Figure 3: Effect of various text prompt techniques on visual saliency maps. Columns (b)-(f) represent different prompt methods.
  • Figure S1: Average classification accuracy (%) of various few-shot adaptation methods across different numbers of training shots per class.
  • Figure S2: Effect of selection threshold ($\zeta_s$) on Base-to-Novel Generalization