Table of Contents
Fetching ...

LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

Mert Asim Karaoglu, Wenbo Ji, Ahmed Abbas, Nassir Navab, Benjamin Busam, Alexander Ladikos

TL;DR

LiteTracker tackles the challenge of real-time tissue tracking in endoscopy by delivering a frame-by-frame, low-latency variant of long-term point tracking. It extends CoTracker3 with a training-free temporal memory buffer and Exponential Moving Average (EMA) flow initialization to enable efficient online tracking, achieving roughly 7× speedups over the previous method and 2× over the current fastest baselines, while maintaining competitive accuracy on STIR and SuPer datasets. Key ideas include caching expensive correlation features, masking proxies in attention, and initializing new frame locations via $F_t=\alpha (P_{t-1}-P_{t-2})+(1-\alpha)F_{t-1}$ with $\alpha=0.8$, enabling a single-pass refinement ($L=1$). The results demonstrate substantial practical impact for real-time surgical navigation and XR, with code released for reproducibility.

Abstract

Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room. Our code is publicly available at https://github.com/ImFusionGmbH/lite-tracker.

LiteTracker: Leveraging Temporal Causality for Accurate Low-latency Tissue Tracking

TL;DR

LiteTracker tackles the challenge of real-time tissue tracking in endoscopy by delivering a frame-by-frame, low-latency variant of long-term point tracking. It extends CoTracker3 with a training-free temporal memory buffer and Exponential Moving Average (EMA) flow initialization to enable efficient online tracking, achieving roughly 7× speedups over the previous method and 2× over the current fastest baselines, while maintaining competitive accuracy on STIR and SuPer datasets. Key ideas include caching expensive correlation features, masking proxies in attention, and initializing new frame locations via with , enabling a single-pass refinement (). The results demonstrate substantial practical impact for real-time surgical navigation and XR, with code released for reproducibility.

Abstract

Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room. Our code is publicly available at https://github.com/ImFusionGmbH/lite-tracker.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (Top) Demonstration of LiteTracker on a video from STIR Challenge 2024 dataset schmidt2025point. (Bottom) Latency and tracking accuracy comparison. Average Jaccard (AJ) metric is computed on SuPer dataset li2020super. Latencies showcase the 95th percentile of inference step measurements for 1,024 points initalized at the first frame of a video with 615 frames. LiteTracker is approximately $2\times$ faster than the fastest evaluated method, Track-On aydemir2025track, and $7\times$ than its predecessor, CoTracker3 karaev2024cotracker3, while exhibiting close to state-of-the-art tracking accuracy.
  • Figure 2: LiteTracker's architecture. Given a video stream and a set of query points, we extract feature maps for each new frame and compute correlation features between the queries and points initialized with exponential moving average flow (EMA flow). We store these correlation features in a temporal memory buffer for efficient re-use on subsequent frames. Utilizing a transformer we propagate spatio-temporal information via attention mechanism to yield new point locations, visibility and confidence scores.
  • Figure 3: Qualitative results on video samples from the STIR Challenge 2024 schmidt2025point (top) and StereoMIS hayoz2023learning (bottom) datasets. LiteTracker shows high tissue-tracking accuracy and occlusion handling under challenging deformations, tool interactions and perspective changes.
  • Figure 4: Ablation studies. (Left) Exponential moving average flow (EMA flow) initialization improves tracking convergence, leading to its highest average tracking accuracy on the STIR Challenge 2024 dataset schmidt2025point with a single pass through the iterative refinement module. (Right) Temporal memory buffer improves runtime efficiency by approximately $2.7$ times and provides low-latency inference during frame-by-frame tracking.