Table of Contents
Fetching ...

HIPTrack: Visual Tracking with Historical Prompts

Wenrui Cai, Qingjie Liu, Yunhong Wang

TL;DR

HIPTrack tackles the challenge of visual tracking under appearance variations by introducing a historical prompt network that encodes refined historical foreground masks and target visuals into a memory bank and adaptively decodes prompts for the current search region. The tracker builds on a frozen Vision Transformer backbone and a light-weight encoder–decoder memory module to generate history-aware prompts without retraining the entire model, achieving state-of-the-art results on LaSOT, LaSOT_{ext}, GOT-10k, and NfS while running efficiently at reported FPS. The historical prompts function as a plug-and-play enhancement that improves robustness to occlusion, deformation, and scale variation, with ablations confirming the importance of memory size, update cadence, and the quality of the encoded history. Overall, HIPTrack demonstrates that precise, updated historical information, accessed via prompt learning and memory retrieval, can substantially boost Siamese-style trackers in real-world scenarios.

Abstract

Trackers that follow Siamese paradigm utilize similarity matching between template and search region features for tracking. Many methods have been explored to enhance tracking performance by incorporating tracking history to better handle scenarios involving target appearance variations such as deformation and occlusion. However, the utilization of historical information in existing methods is insufficient and incomprehensive, which typically requires repetitive training and introduces a large amount of computation. In this paper, we show that by providing a tracker that follows Siamese paradigm with precise and updated historical information, a significant performance improvement can be achieved with completely unchanged parameters. Based on this, we propose a historical prompt network that uses refined historical foreground masks and historical visual features of the target to provide comprehensive and precise prompts for the tracker. We build a novel tracker called HIPTrack based on the historical prompt network, which achieves considerable performance improvements without the need to retrain the entire model. We conduct experiments on seven datasets and experimental results demonstrate that our method surpasses the current state-of-the-art trackers on LaSOT, LaSOText, GOT-10k and NfS. Furthermore, the historical prompt network can seamlessly integrate as a plug-and-play module into existing trackers, providing performance enhancements. The source code is available at https://github.com/WenRuiCai/HIPTrack.

HIPTrack: Visual Tracking with Historical Prompts

TL;DR

HIPTrack tackles the challenge of visual tracking under appearance variations by introducing a historical prompt network that encodes refined historical foreground masks and target visuals into a memory bank and adaptively decodes prompts for the current search region. The tracker builds on a frozen Vision Transformer backbone and a light-weight encoder–decoder memory module to generate history-aware prompts without retraining the entire model, achieving state-of-the-art results on LaSOT, LaSOT_{ext}, GOT-10k, and NfS while running efficiently at reported FPS. The historical prompts function as a plug-and-play enhancement that improves robustness to occlusion, deformation, and scale variation, with ablations confirming the importance of memory size, update cadence, and the quality of the encoded history. Overall, HIPTrack demonstrates that precise, updated historical information, accessed via prompt learning and memory retrieval, can substantially boost Siamese-style trackers in real-world scenarios.

Abstract

Trackers that follow Siamese paradigm utilize similarity matching between template and search region features for tracking. Many methods have been explored to enhance tracking performance by incorporating tracking history to better handle scenarios involving target appearance variations such as deformation and occlusion. However, the utilization of historical information in existing methods is insufficient and incomprehensive, which typically requires repetitive training and introduces a large amount of computation. In this paper, we show that by providing a tracker that follows Siamese paradigm with precise and updated historical information, a significant performance improvement can be achieved with completely unchanged parameters. Based on this, we propose a historical prompt network that uses refined historical foreground masks and historical visual features of the target to provide comprehensive and precise prompts for the tracker. We build a novel tracker called HIPTrack based on the historical prompt network, which achieves considerable performance improvements without the need to retrain the entire model. We conduct experiments on seven datasets and experimental results demonstrate that our method surpasses the current state-of-the-art trackers on LaSOT, LaSOText, GOT-10k and NfS. Furthermore, the historical prompt network can seamlessly integrate as a plug-and-play module into existing trackers, providing performance enhancements. The source code is available at https://github.com/WenRuiCai/HIPTrack.
Paper Structure (26 sections, 2 equations, 10 figures, 10 tables)

This paper contains 26 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Visualized comparisons of our approach and other excellent trackers GRM Gao_2023_CVPR_GRM and SeqTrack Chen_2023_CVPR_seqtrack. Our method performs better when the target suffer from occlusion, deformation and scale variation.
  • Figure 2: (a) shows the varying performance of trackers on LaSOT fan2019lasot as the template update intervals change. (b) shows the varying performance of trackers on LaSOT as the crop factor of current search regions change. A larger crop factor indicates coarser cropping. Each cross symbol represents the baseline of the corresponding color method. Note that TransT chen2021transformer does not have a fixed crop factor, so we choose to use an average crop factor instead.
  • Figure 3: Overview of our proposed HIPTrack. The whole structure consists of a feature extraction network, a history prompt network, and a head prediction network. The historical prompt network comprises a historical prompt encoder and a historical prompt decoder.
  • Figure 4: The structure of the historical prompt encoder and the historical prompt decoder. Zoom in for a clearer view.
  • Figure 5: The performance of our method compared with other state-of-the-art trackers in terms of AUC across various scenarios in the LaSOT test split.
  • ...and 5 more figures