Explicit Visual Prompts for Visual Object Tracking

Liangtao Shi; Bineng Zhong; Qihua Liang; Ning Li; Shengping Zhang; Xianxian Li

Explicit Visual Prompts for Visual Object Tracking

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Shengping Zhang, Xianxian Li

TL;DR

EVPTrack addresses the when-and-how-to-update dilemma in visual tracking by propagating spatio-temporal information through tokens and generating explicit prompts that are fused with image tokens via a transformer encoder. The method introduces three components—Image-Prompt Encoder, Spatio-Temporal Encoder, and Prompt Generator—to exploit both spatio-temporal and multi-scale information without online template updating. Key contributions include the explicit visual prompts framework, a spatio-temporal propagation mechanism that avoids update strategies, and thorough ablations showing that multi-scale and spatio-temporal prompts improve robustness while maintaining real-time performance across six benchmarks. The approach demonstrates strong generalization and practical impact for real-time tracking in complex scenes, with code and models publicly available.

Abstract

How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

Explicit Visual Prompts for Visual Object Tracking

TL;DR

Abstract

, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.

Paper Structure (14 sections, 4 equations, 6 figures, 6 tables)

This paper contains 14 sections, 4 equations, 6 figures, 6 tables.

Introduction
Related Work
Method
Overview
Image-Prompt Encoder
Spatio-Temporal Encoder
Prompt Generator
Training and Inference
Experiments
Implementation Details
Comparison with State-of-the-art Trackers
Ablation Study and Analysis
Conclusions
Acknowledgements

Figures (6)

Figure 1: Comparison of tracking frameworks. (a) The framework with an initial templateSiamFCtranst. (b) The framework with a dynamic templatestarkmixformer. (c) Our EVPTrack framework uses tokens to propagate spatio-temporal information.
Figure 2: Overview of our framework. The input images are patch embedding to get tokens. Then, Image-Prompt Encoder is used for feature fusion between image tokens and prompts. Finally, the fused search tokens will be used to estimate the target state. In addition, Spatio-Temporal Encoder is used to propagate spatio-temporal information between consecutive frames. Prompt Generator is used to generate explicit visual prompts.
Figure 3: Illustration of Spatio-Temporal Encoder propagation of temporal information.
Figure 4: (a): Illustration of multi-scale prompt generator. (b): Illustration of spatio-temporal prompt generator.
Figure 5: AUC scores of different attributes on LaSOT.
...and 1 more figures

Explicit Visual Prompts for Visual Object Tracking

TL;DR

Abstract

Explicit Visual Prompts for Visual Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)