Table of Contents
Fetching ...

Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues

X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang

TL;DR

This work tackles the challenge of leveraging textual cues in Vision-Language Tracking under data scarcity by introducing CTVLT, a plug-and-play framework that converts textual descriptions into visual heatmaps using a foundation grounding model (Grounding DINO). The textual cue mapping module produces a target distribution heatmap $H_l$, which is integrated with the search features through a heatmap guidance module to steer tracking. Extensive experiments on MGIT, TNL2K, and LaSOT demonstrate state-of-the-art performance and robust gains from the heatmap-based textual guidance, with ablations confirming the superiority of the refined heatmap over naive cues. While the approach improves cross-modal alignment and tracking accuracy, it adds computational overhead, which the authors suggest mitigating via asynchronous inference in future work.

Abstract

Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.

Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues

TL;DR

This work tackles the challenge of leveraging textual cues in Vision-Language Tracking under data scarcity by introducing CTVLT, a plug-and-play framework that converts textual descriptions into visual heatmaps using a foundation grounding model (Grounding DINO). The textual cue mapping module produces a target distribution heatmap , which is integrated with the search features through a heatmap guidance module to steer tracking. Extensive experiments on MGIT, TNL2K, and LaSOT demonstrate state-of-the-art performance and robust gains from the heatmap-based textual guidance, with ablations confirming the superiority of the refined heatmap over naive cues. While the approach improves cross-modal alignment and tracking accuracy, it adds computational overhead, which the authors suggest mitigating via asynchronous inference in future work.

Abstract

Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.
Paper Structure (15 sections, 6 equations, 3 figures, 2 tables)

This paper contains 15 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic diagram of motivation and method paradigm innovation.(a): Comparison of training environments between vision-language trackers and foundation grounding models. (b): The severe scarcity of textual data limits the tracker’s ability to understand text, making direct use of textual cues for guidance challenging. (c): Our core insight is to leverage the strong text-image alignment capabilities of foundation grounding models by first converting textual cues into visual cues that the tracker can easily interpret, and then using them to guide the tracker.
  • Figure 2: Framework of our vision-language tracker (CTVLT). Our proposed textual cue utilization method consists of the encoder module from a foundation grounding model, along with our designed textual cue mapping module and heatmap guidance module. As a plug-and-play method, it can be seamlessly integrated between the encoder and prediction modules of a visual tracker, transforming it into a vision-language tracker.
  • Figure 3: Visualization of different attention maps for search features with respect to textual features. (a): Given search image and textual cue. (b)-(e): Attention maps at different scales obtained through a naive process. (f): Refined result after applying our proposed method.