Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang
TL;DR
This work tackles the challenge of leveraging textual cues in Vision-Language Tracking under data scarcity by introducing CTVLT, a plug-and-play framework that converts textual descriptions into visual heatmaps using a foundation grounding model (Grounding DINO). The textual cue mapping module produces a target distribution heatmap $H_l$, which is integrated with the search features through a heatmap guidance module to steer tracking. Extensive experiments on MGIT, TNL2K, and LaSOT demonstrate state-of-the-art performance and robust gains from the heatmap-based textual guidance, with ablations confirming the superiority of the refined heatmap over naive cues. While the approach improves cross-modal alignment and tracking accuracy, it adds computational overhead, which the authors suggest mitigating via asynchronous inference in future work.
Abstract
Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.
