Table of Contents
Fetching ...

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin

TL;DR

TAPNext reframes Tracking Any Point as next-token prediction, using a shared spatio-temporal token bank and a causal recurrent backbone (SSM) coupled with ViT blocks to track points online without tracking-specific biases. It predicts point coordinates as distributions via a 256-bin classification head, enabling sub-pixel accuracy through expectation, and relies on masked decoding to impute missing tokens across frames. Trained on a large synthetic Kubric dataset and refined with BootsTAPNext on real data, TAPNext achieves state-of-the-art performance on TAP-Vid with minimal latency and demonstrates emergent, interpretable attention patterns. The work highlights a scalable, end-to-end approach to TAP that can extend to broader video understanding tasks while identifying avenues for improving long-horizon generalization.

Abstract

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

TL;DR

TAPNext reframes Tracking Any Point as next-token prediction, using a shared spatio-temporal token bank and a causal recurrent backbone (SSM) coupled with ViT blocks to track points online without tracking-specific biases. It predicts point coordinates as distributions via a 256-bin classification head, enabling sub-pixel accuracy through expectation, and relies on masked decoding to impute missing tokens across frames. Trained on a large synthetic Kubric dataset and refined with BootsTAPNext on real data, TAPNext achieves state-of-the-art performance on TAP-Vid with minimal latency and demonstrates emergent, interpretable attention patterns. The work highlights a scalable, end-to-end approach to TAP that can extend to broader video understanding tasks while identifying avenues for improving long-horizon generalization.

Abstract

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.

Paper Structure

This paper contains 20 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Dense grid tracking with TAPNext. We show (a) the query points on the first frame of the video, (b) the resulting tracks on the final frame of the video for CoTracker3 cotracker3, and (c) our proposed TAPNext method.
  • Figure 2: TAPNext performs tracking via imputation of unknown point coordinates given known ones (query points and the video). This imputation happens via temporal masked decoding of tokens: video tokens are concatenated with point coordinate tokens and the latter inject point query information via positional encoding.
  • Figure 3: Three attention patterns learned by TAPNext. We visualize attention maps where the attention queries are the point track tokens and the keys are image tokens, which correspond to $8\times8$ patches. Each row is a certain (layer, head) pair. We observe patterns: (top) Cost-volume-like attention head; (middle) Coordinate-based readout head; (bottom) motion-cluster-based readout head. Note that these are just intermediate heads in the backbone. Higher resolution image and full attention maps in Appendix \ref{['app:attn_vis']}.
  • Figure 4: Point-to-point attention map visualizations. Tracked points are nodes and (scaled) attention weights are edges, the thicker the edge the higher the weight between points. Two frames from a video are used to visualize two attention layers. Note that in all images we see strong attention between points on objects that are moving together. See higher resolution images in Appendix \ref{['app:attn_vis']}.
  • Figure 5: Video Completion by TAPNext variant. Left: Outputs of patch-level linear pixel heads. Right: Inputs to the model (Visible or masked image and points).
  • ...and 9 more figures