Table of Contents
Fetching ...

AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models

Uxue Delaquintana-Aramendi, Leire Benito-del-Valle, Aitor Alvarez-Gila, Javier Pascau, Luisa F Sánchez-Peralta, Artzai Picón, J Blas Pagador, Cristina L Saratxaga

TL;DR

This work assesses the effectiveness of foundation models for polyp detection and segmentation in colonoscopy by benchmarking five foundation models against traditional baselines across three datasets. A two-stage detection-then-segmentation pipeline is evaluated, with results showing that domain specialization and fine-tuning are essential for medical imaging tasks; the combination of GroundingDINO detection with MedSAM segmentation delivers the strongest performance, even in zero-shot scenarios for some cases. The study highlights the importance of domain-specific models (MedSAM) and adaptive detectors (GDINO) to achieve robust performance across diverse imaging conditions and polyp morphologies, suggesting a viable path toward clinically useful AI-assisted colonoscopy. Overall, the findings indicate that a domain-specialized, dual-model approach can achieve high accuracy and generalization, potentially reducing missed polyps and improving prognosis in colorectal cancer screening.

Abstract

In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A concept that is particularly advantageous in the medical imaging domain, where large annotated datasets for traditional training are scarce. In this context, a comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation. For the study, three different colonoscopy datasets have been employed to compare the performance of five different foundation models, DINOv2, YOLO-World, GroundingDINO, SAM and MedSAM, against two benchmark networks, YOLOv8 and Mask R-CNN. Results show that the success of foundation models in polyp characterization is highly dependent on domain specialization. For optimal performance in medical applications, domain-specific models are essential, and generic models require fine-tuning to achieve effective results. Through this specialization, foundation models demonstrated superior performance compared to state-of-the-art detection and segmentation models, with some models even excelling in zero-shot evaluation; outperforming fine-tuned models on unseen data.

AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models

TL;DR

This work assesses the effectiveness of foundation models for polyp detection and segmentation in colonoscopy by benchmarking five foundation models against traditional baselines across three datasets. A two-stage detection-then-segmentation pipeline is evaluated, with results showing that domain specialization and fine-tuning are essential for medical imaging tasks; the combination of GroundingDINO detection with MedSAM segmentation delivers the strongest performance, even in zero-shot scenarios for some cases. The study highlights the importance of domain-specific models (MedSAM) and adaptive detectors (GDINO) to achieve robust performance across diverse imaging conditions and polyp morphologies, suggesting a viable path toward clinically useful AI-assisted colonoscopy. Overall, the findings indicate that a domain-specialized, dual-model approach can achieve high accuracy and generalization, potentially reducing missed polyps and improving prognosis in colorectal cancer screening.

Abstract

In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A concept that is particularly advantageous in the medical imaging domain, where large annotated datasets for traditional training are scarce. In this context, a comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation. For the study, three different colonoscopy datasets have been employed to compare the performance of five different foundation models, DINOv2, YOLO-World, GroundingDINO, SAM and MedSAM, against two benchmark networks, YOLOv8 and Mask R-CNN. Results show that the success of foundation models in polyp characterization is highly dependent on domain specialization. For optimal performance in medical applications, domain-specific models are essential, and generic models require fine-tuning to achieve effective results. Through this specialization, foundation models demonstrated superior performance compared to state-of-the-art detection and segmentation models, with some models even excelling in zero-shot evaluation; outperforming fine-tuned models on unseen data.

Paper Structure

This paper contains 20 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Algorithm workflow diagram. The process begins with an image input, accompanied by a text prompt (optional) specifying the object of interest. A detection model generates bounding boxes around detected objects, which are then passed, along with the original image, to a segmentation model. The latter produces the final polyp mask.
  • Figure 2: Samplewise AP for Flat polyp detection (Paris classification). Left to right: YOLOv8 FT, MR-CNN R50 FT, MR-CNN DinoV2 FT, YOLOWorld FT, GDINO FT.
  • Figure 3: Performance on the analyzed images. First column shows the original input image (from PICCOLO, PolypSegm-ASH or SUN-SEG datasets) with the corresponding Paris classification for PICCOLO. Second, third and fourth columns present the detection/segmentation results from the fine-tuned Mask R-CNN (baseline), fine-tuned GroundingDINO + MedSAM (best-performing) and GroundingDINO + SAM (worst-performing), respectively. Final column provides the ground truth annotations.