Table of Contents
Fetching ...

Tracking Everything in Robotic-Assisted Surgery

Bohan Zhan, Wang Zhao, Yi Fang, Bo Du, Francisco Vasconcelos, Danail Stoyanov, Daniel S. Elson, Baoru Huang

TL;DR

The paper tackles the problem of accurately tracking tissues and instruments in RAMIS videos, where traditional sparse keypoint methods and dense optical flow struggle under deformation, occlusion, and rapid instrument motion. It introduces a new annotated surgical tracking dataset with 20 real-world videos and 25 points per frame to benchmark tracking methods, and proposes SurgMotion, a TAP-based tracker enhanced with a tool mask constraint, ARAP consistency, and a LoFTR-guided long-term loss. Extensive experiments show SurgMotion outperforms state-of-the-art TAP-based methods on surgical instrument tracking, especially in challenging scenarios, while maintaining strong tissue tracking, and ablations verify the effectiveness of each loss term. The work provides a practical contribution to RAMIS by delivering a rigorous evaluation resource and a robust tracking method, with code and data publicly available to accelerate further research.

Abstract

Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos. Our code and dataset are available at https://github.com/zhanbh1019/SurgicalMotion.

Tracking Everything in Robotic-Assisted Surgery

TL;DR

The paper tackles the problem of accurately tracking tissues and instruments in RAMIS videos, where traditional sparse keypoint methods and dense optical flow struggle under deformation, occlusion, and rapid instrument motion. It introduces a new annotated surgical tracking dataset with 20 real-world videos and 25 points per frame to benchmark tracking methods, and proposes SurgMotion, a TAP-based tracker enhanced with a tool mask constraint, ARAP consistency, and a LoFTR-guided long-term loss. Extensive experiments show SurgMotion outperforms state-of-the-art TAP-based methods on surgical instrument tracking, especially in challenging scenarios, while maintaining strong tissue tracking, and ablations verify the effectiveness of each loss term. The work provides a practical contribution to RAMIS by delivering a rigorous evaluation resource and a robust tracking method, with code and data publicly available to accelerate further research.

Abstract

Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos. Our code and dataset are available at https://github.com/zhanbh1019/SurgicalMotion.
Paper Structure (20 sections, 6 equations, 4 figures, 3 tables)

This paper contains 20 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The demonstration of our method for tracking every point across entire surgical video. (a), (b) present results from different videos, with the tracking of instruments displayed from left to right over time.
  • Figure 2: Examples of the dataset, where tissues and surgical tools are annotated separately. The red cross indicates that this point is occluded in this frame.
  • Figure 3: Method overview: First, 2D points $p_i$ are lifted to 3D points $x_i$. Then, by using a bijective transformation $T_i$, the $x_i$ in the local frame are mapped to a canonical 3D volume as $u$, and subsequently mapped to another local frame through an inverse bijection. A coordinate-based network $F_\theta$ is employed to compute the corresponding color $c$ and density $\sigma$ of point $u$ in the canonical volume, with the 2D positions obtained through alpha compositing. To ensure that points on the tools remain correctly mapped to the tool area, we introduce a tool mask and ARAP constraints. Additionally, since OmniMotion wang2023tracking is supervised by optical flow (OF), which becomes inaccurate in distant frames, we incorporate LoFTR sun2021loftr feature matching to enhance long-term tracking capabilities.
  • Figure 4: Qualitative comparison of our method with other baselines on our dataset. The leftmost column shows the initial query points. The three columns on the right display the tracking results over time. Occluded points are marked with a cross “+” and their estimated positions are shown. Notably, the white dashed boxes highlight CoTracker's incorrect occlusion predictions, whereas our method produces accurate results in these cases.