See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors
Kunyi Yang, Qingyu Wang, Cheng Yuan, Yutong Ban
TL;DR
The paper tackles annotation bottlenecks in surgical scene segmentation by introducing DepSeg, a training-free framework that leverages monocular depth priors to guide SAM2 mask generation and uses a frozen DINOv3 encoder to label masks via a pre-built template bank. It builds class-descriptor templates offline from limited annotations and performs class assignment through cosine similarity with top-k aggregation, all without fine-tuning. On CholecSeg8k, DepSeg outperforms a direct SAM2 baseline, scales effectively with fewer templates, and remains fully training-free, illustrating a practical path toward annotation-efficient surgical scene understanding. The approach offers easy extensibility by adding templates for new classes and points toward future work in temporal consistency and rare-class coverage.
Abstract
Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.
