One Patient's Annotation is Another One's Initialization: Towards Zero-Shot Surgical Video Segmentation with Cross-Patient Initialization
Seyed Amir Mousavi, Utku Ozbulak, Francesca Tozzi, Nikdokht Rashidian, Wouter Willaert, Joris Vankerschaver, Wesley De Neve
TL;DR
This work tackles the practical challenge of initializing video object segmentation in real-time surgery by transferring annotated frames from other patients as donor references. Using the zero-shot SAM2.1 Hiera Large model, the authors evaluate cross-patient donor frames on the CholecSeg8k dataset, comparing donor-based initialization against same-patient baselines. They show that donor frames can match or even exceed baseline performance in some cases, but results vary substantially with object type and frame choice, highlighting the need for effective donor-frame selection. The study suggests that reducing manual intervention in surgical VOS is feasible and potentially transformative for autonomous AI-assisted workflows, while also outlining avenues for automated donor-frame selection and improved robustness.
Abstract
Video object segmentation is an emerging technology that is well-suited for real-time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients' own tracking frames, enabling more autonomous and efficient AI-assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross-patient frame selection for real-time surgical video analysis.
