Table of Contents
Fetching ...

ASPS: Augmented Segment Anything Model for Polyp Segmentation

Huiqian Li, Dingwen Zhang, Jieru Yao, Longfei Han, Zhongyu Li, Junwei Han

TL;DR

This work addresses the domain gap of the Segment Anything Model (SAM) for polyp segmentation in endoscopy by introducing Augmented SAM for Polyp Segmentation (ASPS). ASPS combines Cross-branch Feature Augmentation (CFA), which fuses a trainable CNN encoder with the frozen ViT encoder via cross-branch attention and replaces position embeddings to better capture local details, and Uncertainty-guided Prediction Regularization (UPR), which tunes normalization and uses IoU-based hints to calibrate confidence and reduce uncertainty. The training objective blends a segmentation loss $L_s = L_{ce} + 0.5 L_{dice} + L_{mse}$ with a confidence loss $L_c = -\log(c)$, yielding $\mathcal{L} = L_s + \lambda L_c$, where the image- and pixel-level confidences satisfy $c = \tfrac{1}{2}(c_i + c_p)$ and $c_p = 1 - \frac{1}{H\times W}\sum_{i=1}^H\sum_{j=1}^W U_p$ with $U_p = 1 - \sigma(|\mathbf{P}|)$. Evaluations on five polyp datasets show that ASPS delivers significant gains over SAM-based methods, achieving higher Dice and IoU on several datasets while operating without prompts; code is released for public use. The combined CFA and UPR approach demonstrates strong domain generalization and practical potential for clinical polyp segmentation.

Abstract

Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM's IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM's performance in polyp segmentation. Our code is available at https://github.com/HuiqianLi/ASPS.

ASPS: Augmented Segment Anything Model for Polyp Segmentation

TL;DR

This work addresses the domain gap of the Segment Anything Model (SAM) for polyp segmentation in endoscopy by introducing Augmented SAM for Polyp Segmentation (ASPS). ASPS combines Cross-branch Feature Augmentation (CFA), which fuses a trainable CNN encoder with the frozen ViT encoder via cross-branch attention and replaces position embeddings to better capture local details, and Uncertainty-guided Prediction Regularization (UPR), which tunes normalization and uses IoU-based hints to calibrate confidence and reduce uncertainty. The training objective blends a segmentation loss with a confidence loss , yielding , where the image- and pixel-level confidences satisfy and with . Evaluations on five polyp datasets show that ASPS delivers significant gains over SAM-based methods, achieving higher Dice and IoU on several datasets while operating without prompts; code is released for public use. The combined CFA and UPR approach demonstrates strong domain generalization and practical potential for clinical polyp segmentation.

Abstract

Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM's IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM's performance in polyp segmentation. Our code is available at https://github.com/HuiqianLi/ASPS.
Paper Structure (12 sections, 5 equations, 4 figures, 4 tables)

This paper contains 12 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An overview of our Augmented Segment Anything Model for polyp segmentation. The Cross-branch Feature Augmentation module is encouraged to learn multi-scale features and multi-level representations. The Uncertainty-guided Prediction Regularization module is designed to minimize the uncertainty of the prediction to improve the domain generalization ability of the model.
  • Figure 2: Detailed architecture of ViT Encoder and Mask Decoder. (a) represents the ViT encoder, while (b) showcases the lightweight decoder of SAM. The CNN feature is derived from the output of the CNN encoder. The yellow and blue modules represent the original SAM structure.
  • Figure 3: Relative log amplitudes.
  • Figure 4: Several qualitative results.