Table of Contents
Fetching ...

ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe

Yifan Bai, Zeyang Zhao, Yihong Gong, Xing Wei

TL;DR

ARTrackV2 tackles robust visual tracking by jointly modeling where to look and what the target looks like over time. It introduces a unified generative framework that evolves both trajectory and appearance via a pure Transformer encoder, using appearance prompts and a masking strategy to reconstruct appearance across frames. Trained end-to-end on video sequences with sequence-level losses and a MAE-inspired reconstruction objective, ARTrackV2 achieves state-of-the-art AO/AUC on GOT-10k and TrackingNet, while delivering substantial speedups over prior methods. The approach highlights the value of time-continuous, joint trajectory-appearance modeling for accurate and efficient tracking, with potential applicability to broader video understanding tasks.

Abstract

We present ARTrackV2, which integrates two pivotal aspects of tracking: determining where to look (localization) and how to describe (appearance analysis) the target object across video frames. Building on the foundation of its predecessor, ARTrackV2 extends the concept by introducing a unified generative framework to "read out" object's trajectory and "retell" its appearance in an autoregressive manner. This approach fosters a time-continuous methodology that models the joint evolution of motion and visual features, guided by previous estimates. Furthermore, ARTrackV2 stands out for its efficiency and simplicity, obviating the less efficient intra-frame autoregression and hand-tuned parameters for appearance updates. Despite its simplicity, ARTrackV2 achieves state-of-the-art performance on prevailing benchmark datasets while demonstrating remarkable efficiency improvement. In particular, ARTrackV2 achieves AO score of 79.5\% on GOT-10k, and AUC of 86.1\% on TrackingNet while being $3.6 \times$ faster than ARTrack. The code will be released.

ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe

TL;DR

ARTrackV2 tackles robust visual tracking by jointly modeling where to look and what the target looks like over time. It introduces a unified generative framework that evolves both trajectory and appearance via a pure Transformer encoder, using appearance prompts and a masking strategy to reconstruct appearance across frames. Trained end-to-end on video sequences with sequence-level losses and a MAE-inspired reconstruction objective, ARTrackV2 achieves state-of-the-art AO/AUC on GOT-10k and TrackingNet, while delivering substantial speedups over prior methods. The approach highlights the value of time-continuous, joint trajectory-appearance modeling for accurate and efficient tracking, with potential applicability to broader video understanding tasks.

Abstract

We present ARTrackV2, which integrates two pivotal aspects of tracking: determining where to look (localization) and how to describe (appearance analysis) the target object across video frames. Building on the foundation of its predecessor, ARTrackV2 extends the concept by introducing a unified generative framework to "read out" object's trajectory and "retell" its appearance in an autoregressive manner. This approach fosters a time-continuous methodology that models the joint evolution of motion and visual features, guided by previous estimates. Furthermore, ARTrackV2 stands out for its efficiency and simplicity, obviating the less efficient intra-frame autoregression and hand-tuned parameters for appearance updates. Despite its simplicity, ARTrackV2 achieves state-of-the-art performance on prevailing benchmark datasets while demonstrating remarkable efficiency improvement. In particular, ARTrackV2 achieves AO score of 79.5\% on GOT-10k, and AUC of 86.1\% on TrackingNet while being faster than ARTrack. The code will be released.
Paper Structure (20 sections, 3 equations, 9 figures, 9 tables)

This paper contains 20 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Frameworks and performance comparison of trackers following the sequence generation paradigm. (a) SeqTrack views tracking as sequence prediction. (b) ARTrack introduces trajectory evolution. (c) ARTrackV2 incorporates joint trajectory-appearance evolution. (d) Performance comparison.
  • Figure 2: ARTrackV2 framework. Initially, we utilize a Transformer encoder to process all tokens within a frame in parallel, with a masking strategy shown on the top right. Subsequently, appearance tokens are directed to a reconstruction decoder, where the object's appearance within the ongoing search region is reconstructed. Simultaneously, the confidence token is fed into an MLP to predict the IoU between the estimated and ground truth bounding boxes, serving as a measure of the quality of appearance tokens.
  • Figure 3: Comparison of accuracy vs. latency trade-off for different tracking methods in GOT-10k (one-shot setting).
  • Figure 4: Comparison of appearance modeling approaches. (a) discriminative model adopts a score-and-crop strategy to decide updates. (b) generative model learns to reconstruct the template.
  • Figure 5: Attention visualization. (a): Search region and template. The red boxes denote the ground truth. (b)-(e): Appearance tokens to search the cross-attention map of ARTrackV2.
  • ...and 4 more figures