ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Yaozong Zheng; Bineng Zhong; Qihua Liang; Zhiyi Mo; Shengping Zhang; Xianxian Li

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, Xianxian Li

TL;DR

ODTrack reframes visual tracking as online token sequence propagation over video streams, addressing the limitations of offline, sparse image-pair methods. It introduces video-level sampling and two temporal token propagation attention mechanisms (concatenated and separated) to densely encode and pass target trajectory information across frames, serving as prompts for future inferences. Empirical results across seven benchmarks show state-of-the-art performance with real-time speed, validated by comprehensive ablations and visualizations that confirm effective cross-frame propagation and trajectory reasoning. The approach offers a simple yet powerful alternative to online template updates, with broad practical impact for robust, long-term tracking.

Abstract

Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 6 figures, 5 tables)

This paper contains 18 sections, 5 equations, 6 figures, 5 tables.

Introduction
Related Work
Traditional Tracking Framework.
Temporal Modelling in Visual Tracking.
Approach
Question Formulation
Video-Level Tracking Pipeline
Video Sequence Sampling Strategy
Temporal Token Propagation Attention Mechanism
Discussions with Online Update.
Prediction Head and Loss Function
Experiments
Implementation Details
Comparison with the SOTA
Ablation Study
...and 3 more sections

Figures (6)

Figure 1: Comparison of tracking methods. (a) The offline image level tracking methodsSiamRPN++transt based on sparse sampling and image-pair matching. (b) Our online video-level tracking method based on video sequence sampling and temporal token propagation.
Figure 2: ODTrack Framework Architecture. The ODTrack pipeline takes video clips, consisting of reference and search frames, of arbitrary length as input. Then, the model utilizes a temporal token propagation attention mechanism to generate a temporal token for each video frame. These temporal tokens are subsequently propagated to the following frames in an auto-regressive manner, enabling cross-frame propagation of target trajectory information.
Figure 3: Left: the architecture of temporal token propagation attention mechanism. Right: illustration of online token propagation. (a) Original reference-search attention mechanism, (b) and (c) Different variants of the proposed temporal token propagation attention mechanisms. $R$ is a single reference frame, $R_{1...k}$ denotes the reference frames of length $k$, $S$ represents the current search frame, and $T$ is the temporal token sequence of current video frames.
Figure 4: AUC scores of different attributes on LaSOT.
Figure 5: Qualitative comparison results of our tracker with other three SOTA trackers on LaSOT benchmark.
...and 1 more figures

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

TL;DR

Abstract

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)