Table of Contents
Fetching ...

SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation

Tapas Kumar Dutta, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha

TL;DR

Colorectal polyp segmentation is hampered by variable polyp morphology and boundary ambiguity, limiting traditional CNN/ViT approaches. The authors propose SAM-Mamba, which freezes the SAM image encoder and adds a Mamba-Prior module (Multi-scale Spatial Decomposition, Channel Saliency and Context Accumulation, and Mamba Channel Interaction) plus adapters to inject polyp-domain cues and enhance transferability, with a two-stage training objective $L_D = L_w^{Dice} + L_w^{BCE}$. Across five benchmark datasets, SAM-Mamba achieves state-of-the-art or competitive results, with strong zero-shot generalization to unseen datasets, demonstrating practical potential for real-time clinical use. The work highlights how domain-aware priors and long-range dependency modeling can substantially improve generalized segmentation performance when leveraging a powerful foundation model like SAM.

Abstract

Polyp segmentation in colonoscopy is crucial for detecting colorectal cancer. However, it is challenging due to variations in the structure, color, and size of polyps, as well as the lack of clear boundaries with surrounding tissues. Traditional segmentation models based on Convolutional Neural Networks (CNNs) struggle to capture detailed patterns and global context, limiting their performance. Vision Transformer (ViT)-based models address some of these issues but have difficulties in capturing local context and lack strong zero-shot generalization. To this end, we propose the Mamba-guided Segment Anything Model (SAM-Mamba) for efficient polyp segmentation. Our approach introduces a Mamba-Prior module in the encoder to bridge the gap between the general pre-trained representation of SAM and polyp-relevant trivial clues. It injects salient cues of polyp images into the SAM image encoder as a domain prior while capturing global dependencies at various scales, leading to more accurate segmentation results. Extensive experiments on five benchmark datasets show that SAM-Mamba outperforms traditional CNN, ViT, and Adapter-based models in both quantitative and qualitative measures. Additionally, SAM-Mamba demonstrates excellent adaptability to unseen datasets, making it highly suitable for real-time clinical use.

SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation

TL;DR

Colorectal polyp segmentation is hampered by variable polyp morphology and boundary ambiguity, limiting traditional CNN/ViT approaches. The authors propose SAM-Mamba, which freezes the SAM image encoder and adds a Mamba-Prior module (Multi-scale Spatial Decomposition, Channel Saliency and Context Accumulation, and Mamba Channel Interaction) plus adapters to inject polyp-domain cues and enhance transferability, with a two-stage training objective . Across five benchmark datasets, SAM-Mamba achieves state-of-the-art or competitive results, with strong zero-shot generalization to unseen datasets, demonstrating practical potential for real-time clinical use. The work highlights how domain-aware priors and long-range dependency modeling can substantially improve generalized segmentation performance when leveraging a powerful foundation model like SAM.

Abstract

Polyp segmentation in colonoscopy is crucial for detecting colorectal cancer. However, it is challenging due to variations in the structure, color, and size of polyps, as well as the lack of clear boundaries with surrounding tissues. Traditional segmentation models based on Convolutional Neural Networks (CNNs) struggle to capture detailed patterns and global context, limiting their performance. Vision Transformer (ViT)-based models address some of these issues but have difficulties in capturing local context and lack strong zero-shot generalization. To this end, we propose the Mamba-guided Segment Anything Model (SAM-Mamba) for efficient polyp segmentation. Our approach introduces a Mamba-Prior module in the encoder to bridge the gap between the general pre-trained representation of SAM and polyp-relevant trivial clues. It injects salient cues of polyp images into the SAM image encoder as a domain prior while capturing global dependencies at various scales, leading to more accurate segmentation results. Extensive experiments on five benchmark datasets show that SAM-Mamba outperforms traditional CNN, ViT, and Adapter-based models in both quantitative and qualitative measures. Additionally, SAM-Mamba demonstrates excellent adaptability to unseen datasets, making it highly suitable for real-time clinical use.

Paper Structure

This paper contains 25 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the SAM-Mamba framework for polyp segmentation. The architecture constitutes the SAM backbone with the Mamba-Prior module and Adapter-based fine-tuning to enhance adaptability for polyp segmentation, addressing challenges like zero-shot feature transfer-ability, computational cost, and prompt dependency in SAM.
  • Figure 2: Qualitative comparison on seen datasets (Kvasir-SEG and CVC-ClinicDB), showcasing the model's ability to accurately segment polyps across diverse sizes, textures, and homogeneous regions.
  • Figure 3: Qualitative comparison on unseen datasets (CVC-300, CVC-ColonDB, and ETIS), highlighting the model's superior generalization capabilities to accurately segment polyps of various sizes, textures, and homogeneous regions.
  • Figure 4: Illustration of the sequence learning progression within the SAM-Mamba model through a set of heatmap visualizations: the input image is followed by a set of encoder extracted features, the encoder's extracted mask, the decoder refined features, the refined segmentation mask, and the ground truth.