Table of Contents
Fetching ...

One Patient's Annotation is Another One's Initialization: Towards Zero-Shot Surgical Video Segmentation with Cross-Patient Initialization

Seyed Amir Mousavi, Utku Ozbulak, Francesca Tozzi, Nikdokht Rashidian, Wouter Willaert, Joris Vankerschaver, Wesley De Neve

TL;DR

This work tackles the practical challenge of initializing video object segmentation in real-time surgery by transferring annotated frames from other patients as donor references. Using the zero-shot SAM2.1 Hiera Large model, the authors evaluate cross-patient donor frames on the CholecSeg8k dataset, comparing donor-based initialization against same-patient baselines. They show that donor frames can match or even exceed baseline performance in some cases, but results vary substantially with object type and frame choice, highlighting the need for effective donor-frame selection. The study suggests that reducing manual intervention in surgical VOS is feasible and potentially transformative for autonomous AI-assisted workflows, while also outlining avenues for automated donor-frame selection and improved robustness.

Abstract

Video object segmentation is an emerging technology that is well-suited for real-time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients' own tracking frames, enabling more autonomous and efficient AI-assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross-patient frame selection for real-time surgical video analysis.

One Patient's Annotation is Another One's Initialization: Towards Zero-Shot Surgical Video Segmentation with Cross-Patient Initialization

TL;DR

This work tackles the practical challenge of initializing video object segmentation in real-time surgery by transferring annotated frames from other patients as donor references. Using the zero-shot SAM2.1 Hiera Large model, the authors evaluate cross-patient donor frames on the CholecSeg8k dataset, comparing donor-based initialization against same-patient baselines. They show that donor frames can match or even exceed baseline performance in some cases, but results vary substantially with object type and frame choice, highlighting the need for effective donor-frame selection. The study suggests that reducing manual intervention in surgical VOS is feasible and potentially transformative for autonomous AI-assisted workflows, while also outlining avenues for automated donor-frame selection and improved robustness.

Abstract

Video object segmentation is an emerging technology that is well-suited for real-time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients' own tracking frames, enabling more autonomous and efficient AI-assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross-patient frame selection for real-time surgical video analysis.

Paper Structure

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Example images from the CholecSeg8k dataset and their annotations for the gallbladder (purple), liver (light blue), and grasper (green).
  • Figure 2: The impact of donor frame selection on the segmentation performance obtained for a recipient video, measured by average IoU, is illustrated using boxplots for (a) Gallbladder, (b) Liver, and (c) Grasper.
  • Figure 3: (a) Tracking masks obtained from donor patients and (b) segmentation predictions obtained from SAM2 using tracking masks for recipient patients based on the tracked objects. The first row corresponds to the gallbladder, the second row to the liver, and the third row to the surgical grasper. Incorrect predictions (false negatives and false positives) are highlighted in red. Tracked objects in recipient patients obtain an average IOU score $\mathbf{>0.9}$ for their respective videos, demonstrating SAM2's ability to transfer tracking information across different patients.
  • Figure 4: (a) Tracking masks obtained from donor patients and (b) segmentation predictions obtained from SAM2 using tracking masks for recipient patients based on the tracked objects. The first row corresponds to the gallbladder, the second row to the liver, and the third row to the surgical grasper. Incorrect predictions (false negatives and false positives) are highlighted in red. Tracked objects in recipient patients obtain an average IOU score $\mathbf{<0.5}$ for their videos, demonstrating limitations of the approach under certain circumstances.