PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis
Anwesa Choudhuri, Zhongpai Gao, Meng Zheng, Benjamin Planche, Terrence Chen, Ziyan Wu
TL;DR
PolypSegTrack tackles the integration of polyp detection, segmentation, classification, and tracking in colonoscopy videos using a unified foundation-model framework. It introduces a conditional mask loss that enables learning from both segmentation masks and bounding boxes, and a non-heuristic unsupervised tracking mechanism operating in the space of object queries. The backbone is pre-trained on natural images with self-supervised methods (e.g., DINOv2), reducing the need for colonoscopy-specific pretraining. Across multiple benchmarks, the model achieves state-of-the-art performance in detection, segmentation, classification, and tracking, demonstrating improved generalization and potential for clinical deployment.
Abstract
Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.
