Table of Contents
Fetching ...

Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision

Rulin Zhou, Wenlong He, An Wang, Qiqi Yao, Haijun Hu, Jiankun Wang, Xi Zhang an Hongliang Ren

TL;DR

This paper tackles robust tissue point tracking in endoscopic videos where deformation, occlusion, and artifacts challenge tracking and dense annotations are scarce. It introduces Endo-TTAP, integrating a Multi-Facet Guided Attention (MFGA) module that fuses multi-scale flow, semantic embeddings, and motion cues with a two-stage Auxiliary Curriculum Adapter (ACA) to smoothly adapt from synthetic to real data. A hybrid supervision scheme combines unsupervised optical-flow distillation and semi-supervised pseudo-label learning to reduce annotation dependence. Across SurgT, STIR, and the Endo-TAPC5 dataset, Endo-TTAP achieves state-of-the-art accuracy and robustness, especially under occlusion and long sequences, demonstrating potential for improved surgical navigation and scene understanding.

Abstract

Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at https://anonymous.4open.science/r/Endo-TTAP-36E5.

Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision

TL;DR

This paper tackles robust tissue point tracking in endoscopic videos where deformation, occlusion, and artifacts challenge tracking and dense annotations are scarce. It introduces Endo-TTAP, integrating a Multi-Facet Guided Attention (MFGA) module that fuses multi-scale flow, semantic embeddings, and motion cues with a two-stage Auxiliary Curriculum Adapter (ACA) to smoothly adapt from synthetic to real data. A hybrid supervision scheme combines unsupervised optical-flow distillation and semi-supervised pseudo-label learning to reduce annotation dependence. Across SurgT, STIR, and the Endo-TAPC5 dataset, Endo-TTAP achieves state-of-the-art accuracy and robustness, especially under occlusion and long sequences, demonstrating potential for improved surgical navigation and scene understanding.

Abstract

Accurate tissue point tracking in endoscopic videos is critical for robotic-assisted surgical navigation and scene understanding, but remains challenging due to complex deformations, instrument occlusion, and the scarcity of dense trajectory annotations. Existing methods struggle with long-term tracking under these conditions due to limited feature utilization and annotation dependence. We present Endo-TTAP, a novel framework addressing these challenges through: (1) A Multi-Facet Guided Attention (MFGA) module that synergizes multi-scale flow dynamics, DINOv2 semantic embeddings, and explicit motion patterns to jointly predict point positions with uncertainty and occlusion awareness; (2) A two-stage curriculum learning strategy employing an Auxiliary Curriculum Adapter (ACA) for progressive initialization and hybrid supervision. Stage I utilizes synthetic data with optical flow ground truth for uncertainty-occlusion regularization, while Stage II combines unsupervised flow consistency and semi-supervised learning with refined pseudo-labels from off-the-shelf trackers. Extensive validation on two MICCAI Challenge datasets and our collected dataset demonstrates that Endo-TTAP achieves state-of-the-art performance in tissue point tracking, particularly in scenarios characterized by complex endoscopic conditions. The source code and dataset will be available at https://anonymous.4open.science/r/Endo-TTAP-36E5.

Paper Structure

This paper contains 8 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our Endo-TTAP framework. The Auxiliary Curriculum Adapter (ACA) module facilitates progressive initialization and finetuning of the Uncertainty Head and the Occlusion Head. Hybrid datasets are utilized in the two-stage training of a robust tissue point tracking model with Multi-Facet Guided Attention (MFGA).
  • Figure 2: Tissue point tracking comparison of our method (red point) with GT (black point) and MFT neoral2024mft (blue point). Our method exhibits superior tracking results in handling long videos (blue box) and instrument occlusions (red box).
  • Figure 3: Qualitative comparison of the tracking trajectory by MFT neoral2024mft, MFTIQ serych2024mftiq and our Endo-TTAP for (a) normal and (b) challenging cases. The rectangle boxes highlight more accurate and stable point tracking of our approach.