Table of Contents
Fetching ...

Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers

Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, Rongrong Ji

TL;DR

AQATrack addresses the challenge of robust visual tracking under appearance variations by introducing autoregressive target queries within a spatio-temporal transformer framework. The method combines a HiViT-based spatial encoder, a temporal decoder with autoregressive queries and temporal attention, and a parameter-free spatio-temporal fusion module (STM) to fuse static appearance with instantaneous changes. It achieves state-of-the-art or competitive results across six benchmarks (e.g., LaSOT, LaSOT_ext, GOT-10k, TNL2K, UAV123, TrackingNet) while maintaining real-time inference (~65 fps). This approach reduces reliance on hand-designed components and demonstrates the effectiveness of continuous spatio-temporal modeling for robust single-object tracking in diverse conditions.

Abstract

The rich spatio-temporal information is crucial to capture the complicated target appearance variations in visual tracking. However, most top-performing tracking algorithms rely on many hand-crafted components for spatio-temporal information aggregation. Consequently, the spatio-temporal information is far away from being fully explored. To alleviate this issue, we propose an adaptive tracker with spatio-temporal transformers (named AQATrack), which adopts simple autoregressive queries to effectively learn spatio-temporal information without many hand-designed components. Firstly, we introduce a set of learnable and autoregressive queries to capture the instantaneous target appearance changes in a sliding window fashion. Then, we design a novel attention mechanism for the interaction of existing queries to generate a new query in current frame. Finally, based on the initial target template and learnt autoregressive queries, a spatio-temporal information fusion module (STM) is designed for spatiotemporal formation aggregation to locate a target object. Benefiting from the STM, we can effectively combine the static appearance and instantaneous changes to guide robust tracking. Extensive experiments show that our method significantly improves the tracker's performance on six popular tracking benchmarks: LaSOT, LaSOText, TrackingNet, GOT-10k, TNL2K, and UAV123.

Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers

TL;DR

AQATrack addresses the challenge of robust visual tracking under appearance variations by introducing autoregressive target queries within a spatio-temporal transformer framework. The method combines a HiViT-based spatial encoder, a temporal decoder with autoregressive queries and temporal attention, and a parameter-free spatio-temporal fusion module (STM) to fuse static appearance with instantaneous changes. It achieves state-of-the-art or competitive results across six benchmarks (e.g., LaSOT, LaSOT_ext, GOT-10k, TNL2K, UAV123, TrackingNet) while maintaining real-time inference (~65 fps). This approach reduces reliance on hand-designed components and demonstrates the effectiveness of continuous spatio-temporal modeling for robust single-object tracking in diverse conditions.

Abstract

The rich spatio-temporal information is crucial to capture the complicated target appearance variations in visual tracking. However, most top-performing tracking algorithms rely on many hand-crafted components for spatio-temporal information aggregation. Consequently, the spatio-temporal information is far away from being fully explored. To alleviate this issue, we propose an adaptive tracker with spatio-temporal transformers (named AQATrack), which adopts simple autoregressive queries to effectively learn spatio-temporal information without many hand-designed components. Firstly, we introduce a set of learnable and autoregressive queries to capture the instantaneous target appearance changes in a sliding window fashion. Then, we design a novel attention mechanism for the interaction of existing queries to generate a new query in current frame. Finally, based on the initial target template and learnt autoregressive queries, a spatio-temporal information fusion module (STM) is designed for spatiotemporal formation aggregation to locate a target object. Benefiting from the STM, we can effectively combine the static appearance and instantaneous changes to guide robust tracking. Extensive experiments show that our method significantly improves the tracker's performance on six popular tracking benchmarks: LaSOT, LaSOText, TrackingNet, GOT-10k, TNL2K, and UAV123.
Paper Structure (13 sections, 5 equations, 6 figures, 7 tables)

This paper contains 13 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 2: Overview of our framework. It mainly consists of four components, i.e., a spatial encoder for spatial features, a temporal decoder for learning an autoregressive target query that incorporates temporal information(with red arrows), a spatio-temporal feature fusion module(STM) designed for a spatio-temporal feature, and a prediction head.
  • Figure 3: The structure of the temporal decoder is equipped with target query and temporal attention. Here FFN, MHA, and TA are feedforward neural networks, multi-head attention, and temporal attention, respectively. And $Q$ represents the target query.
  • Figure 4: AUC scores of difference attributes on LaSOTlasot. Best viewed in color.
  • Figure 5: Success plots of one-pass evaluation (OPE) about camera motion and motion blur challenges on LaSOTlasot. Best viewed in color and zooming in.
  • Figure 6: Comparison tracking result with other three SOTA trackers on LaSOTlasot benchmark.
  • ...and 1 more figures