Table of Contents
Fetching ...

From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

Shivanshu Agnihotri, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha

TL;DR

This work tackles the challenge of accurate polyp segmentation under diverse and resource-constrained clinical settings. It introduces Polyp-DiFoM, a modular distillation framework that transfers rich representations from foundation models (e.g., SAM, DINOv2, OneFormer, Mask2Former) into lightweight baselines like U-Net, augmented by frequency-domain encoding to capture both semantic and structural details. The method achieves strong cross-dataset generalization and substantial efficiency gains, outperforming vanilla baselines and approaching or matching state-of-the-art methods with far fewer parameters and lower compute. Extensive experiments across five benchmarks demonstrate robust performance on seen and unseen data, with qualitative analyses highlighting sharper boundaries and better generalization, making Polyp-DiFoM well-suited for real-time clinical deployment.

Abstract

Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.

From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

TL;DR

This work tackles the challenge of accurate polyp segmentation under diverse and resource-constrained clinical settings. It introduces Polyp-DiFoM, a modular distillation framework that transfers rich representations from foundation models (e.g., SAM, DINOv2, OneFormer, Mask2Former) into lightweight baselines like U-Net, augmented by frequency-domain encoding to capture both semantic and structural details. The method achieves strong cross-dataset generalization and substantial efficiency gains, outperforming vanilla baselines and approaching or matching state-of-the-art methods with far fewer parameters and lower compute. Extensive experiments across five benchmarks demonstrate robust performance on seen and unseen data, with qualitative analyses highlighting sharper boundaries and better generalization, making Polyp-DiFoM well-suited for real-time clinical deployment.

Abstract

Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.

Paper Structure

This paper contains 22 sections, 10 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed Polyp-DiFoM framework. Our modular distillation-based architecture transfers rich structural and semantic priors from foundation models (SAM, DINOv2, OneFormer, Mask2Former) into a lightweight segmentation baseline (U-Net).
  • Figure 2: Qualitative results comparison across all five datasets with an illustration of phase-wise learning progression