MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

Aadya Arora; Vinay Namboodiri

MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

Aadya Arora, Vinay Namboodiri

TL;DR

MedFocusCLIP tackles few-shot, fine-grained medical image classification by steering CLIP's visual encoder with SAM2 segmentation prompts to focus on ROIs. The method integrates text embeddings with ROI-aware image features via a multimodal fusion module and optimizes a joint objective $L = \lambda L_c + (1 - \lambda) L_e$ to align modalities while maximizing accuracy. Empirical results on COVID, lung disease, brain tumor, and breast cancer datasets show substantial gains over pretrained CLIP in low-data regimes and provide interpretable localization via segmentation masks. Ablation studies confirm the importance of SAM2 over SAM Adapter and CLIP over Swin Transformer in this setup. The work highlights the potential of combining segmentation prompts with vision-language models to improve data-efficient medical image analysis.

Abstract

With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.

MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

TL;DR

to align modalities while maximizing accuracy. Empirical results on COVID, lung disease, brain tumor, and breast cancer datasets show substantial gains over pretrained CLIP in low-data regimes and provide interpretable localization via segmentation masks. Ablation studies confirm the importance of SAM2 over SAM Adapter and CLIP over Swin Transformer in this setup. The work highlights the potential of combining segmentation prompts with vision-language models to improve data-efficient medical image analysis.

Abstract

Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Introduction
Related Works
SAM in Medical Image Segmentation
CLIP
Proposed System
Overview
Methodology
Results And Discussions
Ablation Study
Impact of Using SAM Adapter Instead of SAM2
Replacing CLIP with SWIN Transformer
Conclusion

Figures (3)

Figure 1: This image illustrates the proposed MedFocusCLIP framework, combining zero-shot SAM2 segmentation with few-shot trained CLIP for medical image analysis. SAM2 first segments the original image, which is then processed by CLIP's image encoder utilizing a Vision Transformer (ViT) backbone. This ViT-based encoder extracts rich visual features that are aligned with text embeddings, enabling efficient learning and classification from limited medical imaging data. The ViT architecture in the image encoder allows for better handling of global context in medical images, potentially improving the model's ability to detect subtle abnormalities.
Figure 2: Performance comparison across different architectures and dataset sizes for different datasets.
Figure 3: Sample outputs from various datasets showing regions of interest generated from SAM2.

MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

TL;DR

Abstract

MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

Authors

TL;DR

Abstract

Table of Contents

Figures (3)