Table of Contents
Fetching ...

When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

Woowon Jang, Jiwon Im, Juseung Choi, Niki Rashidian, Wesley De Neve, Utku Ozbulak

TL;DR

The paper investigates the reliability of point-based tracking using SAM2 in surgical videos, with a focus on laparoscopic cholecystectomy. It systematically compares point-based initialization to segmentation-mask initialization across three targets (gallbladder, grasper, L-hook) using three point-placement strategies and multiple point counts on a CholecSeg8k subset. Results show that anatomy, particularly the gallbladder, suffers from boundary ambiguity and tissue similarity, while surgical tools are tracked more effectively with points; increasing the number of points helps but does not fully bridge the gap for anatomical targets. The study offers concrete recommendations for point placement and points toward future work with negative points to enhance robustness in complex surgical scenes.

Abstract

Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

TL;DR

The paper investigates the reliability of point-based tracking using SAM2 in surgical videos, with a focus on laparoscopic cholecystectomy. It systematically compares point-based initialization to segmentation-mask initialization across three targets (gallbladder, grasper, L-hook) using three point-placement strategies and multiple point counts on a CholecSeg8k subset. Results show that anatomy, particularly the gallbladder, suffers from boundary ambiguity and tissue similarity, while surgical tools are tracked more effectively with points; increasing the number of points helps but does not fully bridge the gap for anatomical targets. The study offers concrete recommendations for point placement and points toward future work with negative points to enhance robustness in complex surgical scenes.

Abstract

Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

Paper Structure

This paper contains 8 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Qualitative examples illustrating two distinct tracking outcomes (a) and (b) for the gallbladder. Images in the first column show the ground truth segmentation masks; the second column shows the highest IoU case with 3 tracking points, and the third column shows the highest IoU case with 7 tracking points. Gallbladder regions correctly predicted are highlighted in pink, and incorrect predictions are highlighted in red. Tracking points are shown as white X marks with black outlines.
  • Figure 2: Qualitative examples illustrating three distinct failure modes in point-based tracking for the gallbladder. For each image pair, the left image represents the input with tracking points and thee right one the segmentation output. (a) Failures due to tracking points placed near the object edges, causing the model to lose the target boundary. (b) Failures caused by tissue similarity, where surrounding structures confuse the model and lead to tracking drift. (c) Extraordinary cases that require case-specific investigation, such as partial object visibility or ambiguous visual cues. Incorrect predictions are highlighted in red and correctly predicted regions are highlighted using different colors for three objects (pink, blue, and cyan).