Table of Contents
Fetching ...

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image Segmentation

Sidra Aleem, Fangyijie Wang, Mayug Maniparambil, Eric Arazo, Julia Dietlmeier, Guenole Silvestre, Kathleen Curran, Noel E. O'Connor, Suzanne Little

TL;DR

SaLIP presents a training-free cascade of SAM and CLIP for zero-shot medical image segmentation. It first generates exhaustive part masks with SAM in segment-everything mode, then uses GPT-3.5–driven visually descriptive prompts to query CLIP for the ROI crops, and finally prompts SAM with ROI-derived bounding boxes to obtain the targeted organ segmentation. Across brain MRI, chest X-ray, and fetal head ultrasound datasets, SaLIP markedly improves zero-shot segmentation over unprompted SAM and approaches upper-bound performance set by ground-truth prompts. This approach demonstrates the practical viability of cross-modal foundation models for data-efficient medical imaging segmentation and offers a flexible, plug-and-play tool for clinical tasks without labeled prompts.

Abstract

The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero shot recognition capabilities. However, their unified potential has not yet been explored in medical image segmentation. To adapt SAM to medical imaging, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. This work presents an in depth exploration of integrating SAM and CLIP into a unified framework for medical image segmentation. Specifically, we propose a simple unified framework, SaLIP, for organ segmentation. Initially, SAM is used for part based segmentation within the image, followed by CLIP to retrieve the mask corresponding to the region of interest (ROI) from the pool of SAM generated masks. Finally, SAM is prompted by the retrieved ROI to segment a specific organ. Thus, SaLIP is training and fine tuning free and does not rely on domain expertise or labeled data for prompt engineering. Our method shows substantial enhancements in zero shot segmentation, showcasing notable improvements in DICE scores across diverse segmentation tasks like brain (63.46%), lung (50.11%), and fetal head (30.82%), when compared to un prompted SAM. Code and text prompts are available at: https://github.com/aleemsidra/SaLIP.

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image Segmentation

TL;DR

SaLIP presents a training-free cascade of SAM and CLIP for zero-shot medical image segmentation. It first generates exhaustive part masks with SAM in segment-everything mode, then uses GPT-3.5–driven visually descriptive prompts to query CLIP for the ROI crops, and finally prompts SAM with ROI-derived bounding boxes to obtain the targeted organ segmentation. Across brain MRI, chest X-ray, and fetal head ultrasound datasets, SaLIP markedly improves zero-shot segmentation over unprompted SAM and approaches upper-bound performance set by ground-truth prompts. This approach demonstrates the practical viability of cross-modal foundation models for data-efficient medical imaging segmentation and offers a flexible, plug-and-play tool for clinical tasks without labeled prompts.

Abstract

The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero shot recognition capabilities. However, their unified potential has not yet been explored in medical image segmentation. To adapt SAM to medical imaging, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. This work presents an in depth exploration of integrating SAM and CLIP into a unified framework for medical image segmentation. Specifically, we propose a simple unified framework, SaLIP, for organ segmentation. Initially, SAM is used for part based segmentation within the image, followed by CLIP to retrieve the mask corresponding to the region of interest (ROI) from the pool of SAM generated masks. Finally, SAM is prompted by the retrieved ROI to segment a specific organ. Thus, SaLIP is training and fine tuning free and does not rely on domain expertise or labeled data for prompt engineering. Our method shows substantial enhancements in zero shot segmentation, showcasing notable improvements in DICE scores across diverse segmentation tasks like brain (63.46%), lung (50.11%), and fetal head (30.82%), when compared to un prompted SAM. Code and text prompts are available at: https://github.com/aleemsidra/SaLIP.
Paper Structure (28 sections, 5 equations, 11 figures, 6 tables)

This paper contains 28 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: SAM efficiently segments regions based on prompts such as box or point prompts etc. However, such prompt generation needs domain expertise or annotated data, which is not readily available in medical imaging. To overcome this challenge, we use the segment everything mode to get the mask for every part in the image. Then, using CLIP, we select the mask corresponding to the specific organ and use it to generate prompts.
  • Figure 2: Illustration of SaLIP: $\mathbf{\textit{SAM}}_{\mathbf{\textit{EM}}}$ segments the input image I using a grid-wise set of keypoints $\textit{G}$, as prompts to produce part-based segmentation masks M. To remove $\mathbf{\textit{m}} \in \mathbf{\textit{M}}$ that corresponds to background, a $\mathbf{\textit{A}}_{\mathbf{\textit{threshold}}}$ is applied on M. Subsequently, I is cropped based on $\mathbf{\textit{M}}_{\textit{filtered}}$ to generate a set of crops C, which are then fed into CLIP along with visually descriptive sentences T generated by GPT 3.5. The region of interest crops $\mathbf{\textit{C}}_{\textbf{ROI}}$ are retrieved from CLIP using $\textbf{argmax}$, and $\textbf{top-k}$ crops (two in this case) are selected as ROI. The extracted ROI is leveraged to generate bounding box prompts (coordinates shown in red), which are used to prompt $\mathbf{\textit{SAM}}_{\textbf{PSM}}$ to get the specific organ segmentation $\textbf{S}_{\textbf{org}}$ within I.
  • Figure 3: GT-SAM is the upper bound, un-prompted SAM, SaLIP (ours). The text in yellow refers to the DSC with each method respectively.
  • Figure 4: SAM failure cases: First row: $\mathbf{\textit{SAM}}_{\textbf{EM}}$ fails to generate a mask for the fetal head, resulting in miss-classification by CLIP. Second row: $\mathbf{\textit{SAM}}_{\textbf{EM}}$ generates a mask for the right lung but fails to generate one for the left lung, leading to CLIP retrieving the wrong crop.
  • Figure 5: CLIP failure cases: $\mathbf{\textit{SAM}}_{\textbf{EM}}$ generates multiple masks for both ROIs (left and right lung). First row: CLIP while correctly recognizing the left lung, identifies a second mask for the same lung region and fails to retrieve the one for the right lung. Second row: CLIP does not retrieve the left lung crop.
  • ...and 6 more figures