Table of Contents
Fetching ...

Track-On: Transformer-based Online Point Tracking with Memory

Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Güney

TL;DR

Track-On introduces an online, transformer-based framework for long-term point tracking that operates frame-by-frame without access to future frames. It leverages memory-augmented transformers with spatial memory and context memory to maintain temporal continuity and combat feature drift, while employing patch classification and offset refinement for precise correspondences. The approach achieves state-of-the-art performance among online trackers and competitive results with offline methods across seven TAP-Vid datasets, and its inference-time memory extension enables longer temporal coverage without sacrificing real-time operation. Extensive ablations validate the contributions of the memory modules, patch-ranking, and refinement components, highlighting Track-On as a practical and scalable solution for real-time point tracking in diverse scenes.

Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

Track-On: Transformer-based Online Point Tracking with Memory

TL;DR

Track-On introduces an online, transformer-based framework for long-term point tracking that operates frame-by-frame without access to future frames. It leverages memory-augmented transformers with spatial memory and context memory to maintain temporal continuity and combat feature drift, while employing patch classification and offset refinement for precise correspondences. The approach achieves state-of-the-art performance among online trackers and competitive results with offline methods across seven TAP-Vid datasets, and its inference-time memory extension enables longer temporal coverage without sacrificing real-time operation. Extensive ablations validate the contributions of the memory modules, patch-ranking, and refinement components, highlighting Track-On as a practical and scalable solution for real-time point tracking in diverse scenes.

Abstract

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on

Paper Structure

This paper contains 26 sections, 11 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Offline vs. Online Point Tracking. We propose an online model, tracking points frame-by-frame (right), unlike the dominant offline paradigm where models require access to all frames within a sliding window or the entire video (left). In contrast, our approach allows for frame-by-frame tracking in videos of any length. To capture temporal information, we introduce two memory modules: spatial memory, which tracks changes in the target point, and context memory, which stores broader contextual information from previous states of the point.
  • Figure 2: Overview. We introduce Track-On, a simple transformer-based method for online, frame-by-frame point tracking. The process involves three steps: (i) Visual Encoder, which extracts features from the given frame; (ii) Query Decoder, which decodes interest point queries using the frame’s features; (iii) Point Prediction (highlighted in light blue), where correspondences are estimated in a coarse-to-fine manner, first through patch classification based on similarity, then followed by refinement through offset prediction from a few most likely patches. Note that the squares refer to point queries, while the circles represent predictions, either as point coordinates or visibility.
  • Figure 3: Top-$k$ Points. In certain cases, a patch with high similarity, though not the most similar, is closer to the ground-truth patch. The top-$3$ patch centers, ranked by similarity, are marked with dots, while the ground-truth is represented by a diamond.
  • Figure 4: Ranking Module. The features around the top-$k$ points ($\hat{\mathbf{p}}_t^{{top}}$) with the highest similarity are decoded using deformable attention to extract the corresponding top-$k$ features ($\mathbf{q}_t^{{top}}$). These features are then fused with the decoded query $\mathbf{q}_t^{{dec}}$ using a transformer decoder.
  • Figure 5: Offset Head. Starting with a rough estimation from patch classification (left), where lighter colors indicate higher correlation, we refine the prediction using the offset head (right). The selected patch center and the final prediction are marked by a blue dot and a red dot, respectively, with the ground-truth represented by a diamond.
  • ...and 10 more figures