Table of Contents
Fetching ...

Track-On2: Enhancing Online Point Tracking with Memory

Görkay Aydemir, Weidi Xie, Fatma Güney

Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

Track-On2: Enhancing Online Point Tracking with Memory

Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

Paper Structure

This paper contains 30 sections, 8 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Comparison with prior state-of-the-art. We report $\delta^{x}_{avg}$ on three representative benchmarks: short real videos (DAVIS), mid-length robotic sequences (RoboTAP), and very long synthetic videos (PointOdyssey). CoTracker3 Karaev2024ARXIV and BootsTAPNext Zholus2025ICCV are fine-tuned on real-world data, which boosts their performance. Despite relying only on synthetic training, Track-On2 achieves higher accuracy.
  • Figure 2: Offline vs. Online Point Tracking. We propose an online model, tracking points frame-by-frame (right), unlike the dominant offline paradigm where models require access to all frames within a sliding window or the entire video (left). To capture temporal information, we introduce a memory, storing contextual information from previous states of the point. The blue box denotes the initial query sampled from the first-frame point location, and the purple boxes represent the decoded queries at each timestep, conditioned on both the initial sample and the evolving memory.
  • Figure 3: Max GPU memory usage with increasing video length. We plot the maximum GPU memory usage of state-of-the-art models when tracking 256 points in videos of varying lengths. Models are grouped into three categories based on their input processing strategy: Video, window, and frame. It is observed that video-level models scale poorly and quickly run out of memory on longer sequences, while others remain efficient across all lengths
  • Figure 4: Overview. We introduce Track-On2, a transformer-based method for online, frame-by-frame point tracking. The pipeline consists of three stages: (i) Visual Encoder (top-left), which extracts multi-scale features from each frame using a DINOv3-based ViT-Adapter and fuses them via an FPN; (ii) Query Decoding (bottom-left), where point queries attend to the current frame features and a persistent memory propagated from the previous frame; (iii) Point Prediction (right), which estimates correspondences in a coarse-to-fine manner. Decoded queries are refined by a re-ranking module that incorporates local information from candidate matches, and point locations are predicted by patch-level classification with sub-patch regression, alongside a lightweight visibility head. Refined queries are appended to memory for use in the next frame. $\Box$ denotes a point query, and denotes a prediction.
  • Figure 5: Feature Drift. For the tracks shown below (start, middle, and final frames), the plot above illustrates the decreasing similarity between the features of the initial query and its correspondences over time, with the initial similarity indicated by horizontal dashed lines.
  • ...and 11 more figures