Track-On: Transformer-based Online Point Tracking with Memory
Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Güney
TL;DR
Track-On introduces an online, transformer-based framework for long-term point tracking that operates frame-by-frame without access to future frames. It leverages memory-augmented transformers with spatial memory and context memory to maintain temporal continuity and combat feature drift, while employing patch classification and offset refinement for precise correspondences. The approach achieves state-of-the-art performance among online trackers and competitive results with offline methods across seven TAP-Vid datasets, and its inference-time memory extension enables longer temporal coverage without sacrificing real-time operation. Extensive ablations validate the contributions of the memory modules, patch-ranking, and refinement components, highlighting Track-On as a practical and scalable solution for real-time point tracking in diverse scenes.
Abstract
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules -- spatial memory and context memory -- to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: https://kuis-ai.github.io/track_on
