SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation
Tapas Kumar Dutta, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha
TL;DR
Colorectal polyp segmentation is hampered by variable polyp morphology and boundary ambiguity, limiting traditional CNN/ViT approaches. The authors propose SAM-Mamba, which freezes the SAM image encoder and adds a Mamba-Prior module (Multi-scale Spatial Decomposition, Channel Saliency and Context Accumulation, and Mamba Channel Interaction) plus adapters to inject polyp-domain cues and enhance transferability, with a two-stage training objective $L_D = L_w^{Dice} + L_w^{BCE}$. Across five benchmark datasets, SAM-Mamba achieves state-of-the-art or competitive results, with strong zero-shot generalization to unseen datasets, demonstrating practical potential for real-time clinical use. The work highlights how domain-aware priors and long-range dependency modeling can substantially improve generalized segmentation performance when leveraging a powerful foundation model like SAM.
Abstract
Polyp segmentation in colonoscopy is crucial for detecting colorectal cancer. However, it is challenging due to variations in the structure, color, and size of polyps, as well as the lack of clear boundaries with surrounding tissues. Traditional segmentation models based on Convolutional Neural Networks (CNNs) struggle to capture detailed patterns and global context, limiting their performance. Vision Transformer (ViT)-based models address some of these issues but have difficulties in capturing local context and lack strong zero-shot generalization. To this end, we propose the Mamba-guided Segment Anything Model (SAM-Mamba) for efficient polyp segmentation. Our approach introduces a Mamba-Prior module in the encoder to bridge the gap between the general pre-trained representation of SAM and polyp-relevant trivial clues. It injects salient cues of polyp images into the SAM image encoder as a domain prior while capturing global dependencies at various scales, leading to more accurate segmentation results. Extensive experiments on five benchmark datasets show that SAM-Mamba outperforms traditional CNN, ViT, and Adapter-based models in both quantitative and qualitative measures. Additionally, SAM-Mamba demonstrates excellent adaptability to unseen datasets, making it highly suitable for real-time clinical use.
