Dynamic Updates for Language Adaptation in Visual-Language Tracking
Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song
TL;DR
DUTrack tackles semantic drift in vision-language tracking by dynamically updating multi-modal references, rather than relying on static initial descriptions. It introduces a Dynamic Template Capture Module (DTCM) and a Dynamic Language Update Module (DLUM) within a one-stream unified visual-language transformer (HiViT-based) to refresh visual templates and language descriptions in response to target motion and appearance changes. Through a two-stage training regimen and an update strategy guided by displacement, scale, and color changes, DUTrack achieves state-of-the-art results on four vision-language benchmarks and strong performance on two vision-only benchmarks, demonstrating improved robustness and alignment between reference text and target state. The approach offers a practical path to more reliable VL tracking in dynamic scenes by maintaining cross-modal consistency with ongoing updates, leveraging both a learned description generator and dynamic visual templates.
Abstract
The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.
