Table of Contents
Fetching ...

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Feng Li, Tianhe Ren, Bohan Li, Lei Zhang

TL;DR

TAPTRv2 addresses fine-grained Tracking Any Point by removing cost-volume contamination from point queries and replacing it with an Attention-based Position Update (APU) that leverages key-aware deformable attention. By computing position updates from local attention weights and sampling offsets, the method keeps content features clean for accurate visibility prediction while still exploiting correlation information. The approach remains DETR-like with point queries, enabling a simpler, faster pipeline that achieves state-of-the-art AJ on TAP-Vid-DAVIS and TAP-Vid-Kinetics, and shows robustness across real and synthetic data. These results highlight the practical impact of disentangling position updates from content features and leveraging attention weights as a stand-in for cost-volume in robust, scalable TAP. In TAPTRv2, the position update for each query is governed by $ ext{SoftMax}( ext{Disentangler}(A_t^i / \u221a d)) \,\cdot\ S_t^i$, where $A_t^i$ are attention weights and $S_t^i$ are sampling offsets, enabling accurate localization while preserving clean content features for visibility prediction. The method uses key-aware deformable attention to compute robust attention weights, and a dedicated disentangler to separate content and position updates, which mitigates domain gaps and improves generalization. Empirical results on TAP-Vid-DAVIS and TAP-Vid-Kinetics establish new benchmarks in accuracy and efficiency, reinforcing TAPTRv2 as a practical, scalable solution for tracking any point in videos.

Abstract

In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection TRansformer (DETR) and formulates each tracking point as a point query, making it possible to leverage well-studied operations in DETR-like algorithms. TAPTRv2 improves TAPTR by addressing a critical issue regarding its reliance on cost-volume,which contaminates the point queryś content feature and negatively impacts both visibility prediction and cost-volume computation. In TAPTRv2, we propose a novel attention-based position update (APU) operation and use key-aware deformable attention to realize. For each query, this operation uses key-aware attention weights to combine their corresponding deformable sampling positions to predict a new query position. This design is based on the observation that local attention is essentially the same as cost-volume, both of which are computed by dot-production between a query and its surrounding features. By introducing this new operation, TAPTRv2 not only removes the extra burden of cost-volume computation, but also leads to a substantial performance improvement. TAPTRv2 surpasses TAPTR and achieves state-of-the-art performance on many challenging datasets, demonstrating the superiority

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

TL;DR

TAPTRv2 addresses fine-grained Tracking Any Point by removing cost-volume contamination from point queries and replacing it with an Attention-based Position Update (APU) that leverages key-aware deformable attention. By computing position updates from local attention weights and sampling offsets, the method keeps content features clean for accurate visibility prediction while still exploiting correlation information. The approach remains DETR-like with point queries, enabling a simpler, faster pipeline that achieves state-of-the-art AJ on TAP-Vid-DAVIS and TAP-Vid-Kinetics, and shows robustness across real and synthetic data. These results highlight the practical impact of disentangling position updates from content features and leveraging attention weights as a stand-in for cost-volume in robust, scalable TAP. In TAPTRv2, the position update for each query is governed by , where are attention weights and are sampling offsets, enabling accurate localization while preserving clean content features for visibility prediction. The method uses key-aware deformable attention to compute robust attention weights, and a dedicated disentangler to separate content and position updates, which mitigates domain gaps and improves generalization. Empirical results on TAP-Vid-DAVIS and TAP-Vid-Kinetics establish new benchmarks in accuracy and efficiency, reinforcing TAPTRv2 as a practical, scalable solution for tracking any point in videos.

Abstract

In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection TRansformer (DETR) and formulates each tracking point as a point query, making it possible to leverage well-studied operations in DETR-like algorithms. TAPTRv2 improves TAPTR by addressing a critical issue regarding its reliance on cost-volume,which contaminates the point queryś content feature and negatively impacts both visibility prediction and cost-volume computation. In TAPTRv2, we propose a novel attention-based position update (APU) operation and use key-aware deformable attention to realize. For each query, this operation uses key-aware attention weights to combine their corresponding deformable sampling positions to predict a new query position. This design is based on the observation that local attention is essentially the same as cost-volume, both of which are computed by dot-production between a query and its surrounding features. By introducing this new operation, TAPTRv2 not only removes the extra burden of cost-volume computation, but also leads to a substantial performance improvement. TAPTRv2 surpasses TAPTR and achieves state-of-the-art performance on many challenging datasets, demonstrating the superiority
Paper Structure (19 sections, 5 equations, 7 figures, 5 tables)

This paper contains 19 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of the frameworks among previous works, TAPTR, and TAPTRv2. Inspired by DETR-based detection algorithms, TAPTR formulates the point tracking problem as a detection problem and simplifies the overall pipeline to a well-studied DETR-like framework. After introducing the attention-based position update operation into Transformer decoder layers, the overall pipeline is further simplified to be as straightforward as detection methods. The operations within dashed boxes are executed only once.
  • Figure 2: The overview of TAPTRv2. The image feature preparation part and the point query preparation part prepare the image features of each frame of an input video and the point queries for each tracking point in every frame. The target point detection part takes the prepared image features and point queries as input. For every frame, each point query aims to predict the position and visibility of its target point.
  • Figure 3: Comparison of the decoder layer in TAPTR (a) and TAPTRv2 (b). In TAPTR (a), cost-volume aggregation will contaminate the content feature, affecting cross-attention and leading to the contaminated cost-volume in the next layer. In TAPTRv2 (b), with the introduction of Attention-based Position Update (APU) in cross attention, not only the attention weights are properly used to update the position of each point query and mitigate the domain gap, but also the content feature of each point query is kept uncontaminated, which is crucial for visibility prediction. We use an RGB image to represent the multi-scale feature maps for better visualization.
  • Figure 4: Visualization of the tracking results of TAPTRv2 in the wild. A user writes "house" on one frame and requires TAPTRv2 to track the points in the writing area. Best view in electronic version.
  • Figure 5: The visualization of the attention weight distributions for feature and position updating in our cross attention.
  • ...and 2 more figures