Table of Contents
Fetching ...

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao

TL;DR

MedCLIP-SAM introduces a three‑stage framework that fuses BiomedCLIP with the Segment Anything Model (SAM) to enable zero‑shot and weakly supervised medical image segmentation guided by text prompts. A novel Decoupled Hard Negative Noise Contrastive Estimation (DHN‑NCE) loss finely tunes BiomedCLIP, while gScoreCAM provides saliency‑based prompts that drive SAM to produce segmentation masks; a weakly supervised Residual UNet refines these masks. Validation across ultrasound, MRI, and chest X‑ray tasks demonstrates strong cross‑modal retrieval performance and notable segmentation gains, with several zero‑shot results approaching or surpassing some fully supervised baselines. The work highlights data‑efficient, interactive segmentation as a viable path for clinical deployment and sets the stage for further integration with MedSAM and 3D radiologic segmentation.

Abstract

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

TL;DR

MedCLIP-SAM introduces a three‑stage framework that fuses BiomedCLIP with the Segment Anything Model (SAM) to enable zero‑shot and weakly supervised medical image segmentation guided by text prompts. A novel Decoupled Hard Negative Noise Contrastive Estimation (DHN‑NCE) loss finely tunes BiomedCLIP, while gScoreCAM provides saliency‑based prompts that drive SAM to produce segmentation masks; a weakly supervised Residual UNet refines these masks. Validation across ultrasound, MRI, and chest X‑ray tasks demonstrates strong cross‑modal retrieval performance and notable segmentation gains, with several zero‑shot results approaching or surpassing some fully supervised baselines. The work highlights data‑efficient, interactive segmentation as a viable path for clinical deployment and sets the stage for further integration with MedSAM and 3D radiologic segmentation.

Abstract

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.
Paper Structure (15 sections, 5 equations, 2 figures, 3 tables)