Table of Contents
Fetching ...

Improving Visual Object Tracking through Visual Prompting

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR

This work tackles generic visual object tracking by addressing the mismatch between category-level knowledge in foundation models and the need for instance-aware discrimination. It introduces PiVOT, a promptable tracking framework that uses a Prompt Generation Network to create initial visual prompts and a Relation Modeling module to refine features, guided online by CLIP at test time. Offline training employs discriminative and regression losses, while CLIP refinement during inference enhances robustness to unseen targets without retraining the backbone. Extensive experiments show PiVOT delivers strong, sometimes state-of-the-art, performance across diverse benchmarks, particularly for out-of-distribution targets, and demonstrates the practical value of image-based visual prompting for GOT.

Abstract

Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

Improving Visual Object Tracking through Visual Prompting

TL;DR

This work tackles generic visual object tracking by addressing the mismatch between category-level knowledge in foundation models and the need for instance-aware discrimination. It introduces PiVOT, a promptable tracking framework that uses a Prompt Generation Network to create initial visual prompts and a Relation Modeling module to refine features, guided online by CLIP at test time. Offline training employs discriminative and regression losses, while CLIP refinement during inference enhances robustness to unseen targets without retraining the backbone. Extensive experiments show PiVOT delivers strong, sometimes state-of-the-art, performance across diverse benchmarks, particularly for out-of-distribution targets, and demonstrates the practical value of image-based visual prompting for GOT.

Abstract

Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.
Paper Structure (15 sections, 6 equations, 9 figures, 8 tables)

This paper contains 15 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Given the features of the current frame and reference frames, they are provided to the Prompt Generation Network (PGN). The PGN collaborates with CLIP to enable automatic prompt generation and refinement. It generates an initial visual prompt that highlights the target candidates. The robust zero-shot recognition capability of CLIP for arbitrary objects allows it to effectively distinguish targets from the distractors among the candidates. This capability is leveraged to refine the visual prompt. The Relation Modeling module processes the features of the current frame together with this visual prompt, generating an enhanced feature for the current frame. The Tracking Head processes the refined and reference frame features to formulate a prediction.
  • Figure 2: Overview of PiVOT. During the (a) training phase, we aim to make the tracker promptable by introducing (c) Prompt Generation Network (PGN) and (e) Relation Modeling (RM) module. The PGN learns to generate an initial prompt and RM enables the tracker to be prompted through the visual prompt.(f) Tracking Head predicts the resultant target state and coordinates. During the (b) inference phase, (d) Test-time Prompt Refinement (TPR), leverages CLIP to improve the visual prompt, as the zero-shot contrastive ability of CLIP enables it to handle arbitrary tracking objects. Through our proposed components, the visual prompt can be automatically generated and improved via CLIP without the need for human annotation throughout the sequence. In the case shown in the figure of RM, a prompt that highlights the target cattle location suppresses distractors through the RM module.
  • Figure 3: It shows the success plots of the proposed and competing methods on the NfS, LaSOT, and AVisT datasets with AUC scores in the legend.
  • Figure 4: Attribute analysis on AVisT compares PiVOT with multiple trackers.
  • Figure 5: Prompting visualisation. Given the current frame, we have a template in the blue box, a visual prompt in the yellow, a feature map in the red, and its prompted version after the RM application in the green. RM accentuates the visual prompt-highlighted area. We apply color mapping to the feature map to enhance visualization.
  • ...and 4 more figures