Table of Contents
Fetching ...

Polyp-DAM: Polyp segmentation via depth anything model

Zhuoran Zheng, Chen Wu, Wei Wang, Yeying Jin, Xiuyi Jia

TL;DR

Polyp-DAM tackles automated polyp segmentation by integrating depth priors derived from the Depth Anything Model (DAM) into a lightweight segmentation network. It constructs four-scale RGB-depth inputs fed to a novel multi-scale MixNet (M2ixNet), which combines global and local feature processing to produce accurate masks with a small parameter count. Across five public benchmarks, Polyp-DAM achieves state-of-the-art results and demonstrates robustness to noisy images, highlighting the value of depth priors as a lightweight alternative to fine-tuning large models. This approach offers a practical and scalable path for depth-guided segmentation in endoscopic imaging.

Abstract

Recently, large models (Segment Anything model) came on the scene to provide a new baseline for polyp segmentation tasks. This demonstrates that large models with a sufficient image level prior can achieve promising performance on a given task. In this paper, we unfold a new perspective on polyp segmentation modeling by leveraging the Depth Anything Model (DAM) to provide depth prior to polyp segmentation models. Specifically, the input polyp image is first passed through a frozen DAM to generate a depth map. The depth map and the input polyp images are then concatenated and fed into a convolutional neural network with multiscale to generate segmented images. Extensive experimental results demonstrate the effectiveness of our method, and in addition, we observe that our method still performs well on images of polyps with noise. The URL of our code is \url{https://github.com/zzr-idam/Polyp-DAM}.

Polyp-DAM: Polyp segmentation via depth anything model

TL;DR

Polyp-DAM tackles automated polyp segmentation by integrating depth priors derived from the Depth Anything Model (DAM) into a lightweight segmentation network. It constructs four-scale RGB-depth inputs fed to a novel multi-scale MixNet (M2ixNet), which combines global and local feature processing to produce accurate masks with a small parameter count. Across five public benchmarks, Polyp-DAM achieves state-of-the-art results and demonstrates robustness to noisy images, highlighting the value of depth priors as a lightweight alternative to fine-tuning large models. This approach offers a practical and scalable path for depth-guided segmentation in endoscopic imaging.

Abstract

Recently, large models (Segment Anything model) came on the scene to provide a new baseline for polyp segmentation tasks. This demonstrates that large models with a sufficient image level prior can achieve promising performance on a given task. In this paper, we unfold a new perspective on polyp segmentation modeling by leveraging the Depth Anything Model (DAM) to provide depth prior to polyp segmentation models. Specifically, the input polyp image is first passed through a frozen DAM to generate a depth map. The depth map and the input polyp images are then concatenated and fed into a convolutional neural network with multiscale to generate segmented images. Extensive experimental results demonstrate the effectiveness of our method, and in addition, we observe that our method still performs well on images of polyps with noise. The URL of our code is \url{https://github.com/zzr-idam/Polyp-DAM}.
Paper Structure (12 sections, 5 equations, 4 figures, 3 tables)

This paper contains 12 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: This figure shows the result of processing a polyp image using DAM, which distinguishes the polyp (foreground) from the rest of the image (background) very well.
  • Figure 2: This figure shows the process of processing our global module. Here, the transformation of the feature map dimension is followed by filtering using a $1 \times 1$ convolution, similar to performing an attention operation on this dimension.
  • Figure 3: The structure of our method. First, we obtain the depth map of the input polyp image via DAM. Next, the input image and depth map are bilinearly downsampled into $256 \times 256$, $128 \times 128$, $64 \times 64$, and the original resolution is fed into M$^{2}$ixNet. Finally, our network outputs four masks of different sizes and four sizes generated on GT for learning. It is worth noting that our network used in the evaluation on the benchmark is the image at the model output original resolution.
  • Figure 4: Our method exhibits the best visual results.