Table of Contents
Fetching ...

PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

Anwesa Choudhuri, Zhongpai Gao, Meng Zheng, Benjamin Planche, Terrence Chen, Ziyan Wu

TL;DR

PolypSegTrack tackles the integration of polyp detection, segmentation, classification, and tracking in colonoscopy videos using a unified foundation-model framework. It introduces a conditional mask loss that enables learning from both segmentation masks and bounding boxes, and a non-heuristic unsupervised tracking mechanism operating in the space of object queries. The backbone is pre-trained on natural images with self-supervised methods (e.g., DINOv2), reducing the need for colonoscopy-specific pretraining. Across multiple benchmarks, the model achieves state-of-the-art performance in detection, segmentation, classification, and tracking, demonstrating improved generalization and potential for clinical deployment.

Abstract

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

TL;DR

PolypSegTrack tackles the integration of polyp detection, segmentation, classification, and tracking in colonoscopy videos using a unified foundation-model framework. It introduces a conditional mask loss that enables learning from both segmentation masks and bounding boxes, and a non-heuristic unsupervised tracking mechanism operating in the space of object queries. The backbone is pre-trained on natural images with self-supervised methods (e.g., DINOv2), reducing the need for colonoscopy-specific pretraining. Across multiple benchmarks, the model achieves state-of-the-art performance in detection, segmentation, classification, and tracking, demonstrating improved generalization and potential for clinical deployment.

Abstract

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

Paper Structure

This paper contains 11 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed approach. Our proposed conditional mask loss (Sec. \ref{['sec:app:maskloss']}) allows flexible training. Our unsupervised and non-heuristic tracking on object queries (Sec. \ref{['sec:app:tracking']}) allows effective association of polyps across video frames.
  • Figure 2: Detection, classification, and tracking on 2 videos (top and bottom row) of the KUMC dataset, along with comprehensive reports generated for each video. The reports summarize the polyps IDs, the polyp type (AD: cancerous or HP: benign) with prediction confidences, frame count (Fr. Ct.) of the polyps, their frame of first appearance (1st. Fr.) and their frame of last appearance (Last Fr.).
  • Figure 3: Detection and segmentation results on a few images from the ETIS dataset. Our model is able to discover small and hard to see polyps.