Table of Contents
Fetching ...

Explicit Visual Prompts for Visual Object Tracking

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li

TL;DR

EVPTrack addresses the when-and-how-to-update dilemma in visual tracking by propagating spatio-temporal information through tokens and generating explicit prompts that are fused with image tokens via a transformer encoder. The method introduces three components—Image-Prompt Encoder, Spatio-Temporal Encoder, and Prompt Generator—to exploit both spatio-temporal and multi-scale information without online template updating. Key contributions include the explicit visual prompts framework, a spatio-temporal propagation mechanism that avoids update strategies, and thorough ablations showing that multi-scale and spatio-temporal prompts improve robustness while maintaining real-time performance across six benchmarks. The approach demonstrates strong generalization and practical impact for real-time tracking in complex scenes, with code and models publicly available.

Abstract

How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

Explicit Visual Prompts for Visual Object Tracking

TL;DR

EVPTrack addresses the when-and-how-to-update dilemma in visual tracking by propagating spatio-temporal information through tokens and generating explicit prompts that are fused with image tokens via a transformer encoder. The method introduces three components—Image-Prompt Encoder, Spatio-Temporal Encoder, and Prompt Generator—to exploit both spatio-temporal and multi-scale information without online template updating. Key contributions include the explicit visual prompts framework, a spatio-temporal propagation mechanism that avoids update strategies, and thorough ablations showing that multi-scale and spatio-temporal prompts improve robustness while maintaining real-time performance across six benchmarks. The approach demonstrates strong generalization and practical impact for real-time tracking in complex scenes, with code and models publicly available.

Abstract

How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm , GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.
Paper Structure (14 sections, 4 equations, 6 figures, 6 tables)

This paper contains 14 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of tracking frameworks. (a) The framework with an initial templateSiamFCtranst. (b) The framework with a dynamic templatestarkmixformer. (c) Our EVPTrack framework uses tokens to propagate spatio-temporal information.
  • Figure 2: Overview of our framework. The input images are patch embedding to get tokens. Then, Image-Prompt Encoder is used for feature fusion between image tokens and prompts. Finally, the fused search tokens will be used to estimate the target state. In addition, Spatio-Temporal Encoder is used to propagate spatio-temporal information between consecutive frames. Prompt Generator is used to generate explicit visual prompts.
  • Figure 3: Illustration of Spatio-Temporal Encoder propagation of temporal information.
  • Figure 4: (a): Illustration of multi-scale prompt generator. (b): Illustration of spatio-temporal prompt generator.
  • Figure 5: AUC scores of different attributes on LaSOT.
  • ...and 1 more figures