Table of Contents
Fetching ...

TAPTR: Tracking Any Point with Transformers as Detection

Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang

TL;DR

TAPTR reframes Tracking Any Point as a DETR-like problem, representing each tracked point as a dedicated query with a positional and a content component that are refined across $D$ decoder layers. By integrating a cost-volume inspired signal, 2D deformable attention, and temporal self-attention within a sliding window of $W$ frames, it captures long-range motion while mitigating feature drift, yielding state-of-the-art TAP results with faster inference. The approach is validated on TAP-Vid with strong improvements over prior methods and comprehensive ablations highlight the contribution of each design choice. This simple, scalable baseline opens avenues for leveraging detection and segmentation signals to further enhance TAP tasks in real-world video analysis.

Abstract

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

TAPTR: Tracking Any Point with Transformers as Detection

TL;DR

TAPTR reframes Tracking Any Point as a DETR-like problem, representing each tracked point as a dedicated query with a positional and a content component that are refined across decoder layers. By integrating a cost-volume inspired signal, 2D deformable attention, and temporal self-attention within a sliding window of frames, it captures long-range motion while mitigating feature drift, yielding state-of-the-art TAP results with faster inference. The approach is validated on TAP-Vid with strong improvements over prior methods and comprehensive ablations highlight the contribution of each design choice. This simple, scalable baseline opens avenues for leveraging detection and segmentation signals to further enhance TAP tasks in real-world video analysis.

Abstract

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.
Paper Structure (25 sections, 10 equations, 7 figures, 7 tables)

This paper contains 25 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of our well-designed DETR-like simple framework with meaning-clear point modeling and previous framework with redundant designs and blackbox point modeling. The operation within the dashed box will execute only once.
  • Figure 2: The overview of TAPTR. The video preparation and query preparation parts provide the multi-scale feature map, point queries, and the cost volumes for the point decoder. The point decoder takes these elements as input and processes all frames in parallel. The outputs of the point decoder are sent to our window post-processing module to update the states of the point queries to their belonging tracking points.
  • Figure 3: The overview of sliding window and window updating and padding. "F. Update" indicates the updating of the content feature, and "F. Padding" indicates the padding of the updated feature to the subsequent frames. We use window size 4 and sliding stride 2 for illustration.
  • Figure 4: Red and blue indicate visible and occluded respectively. We manually supplement the ground truth location of invisible points with blue crosses for better comparison. Best view in electronic version.
  • Figure 5: The trajectory of handwriting predicted by TAPTR.
  • ...and 2 more figures