Table of Contents
Fetching ...

Dynamic Updates for Language Adaptation in Visual-Language Tracking

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song

TL;DR

DUTrack tackles semantic drift in vision-language tracking by dynamically updating multi-modal references, rather than relying on static initial descriptions. It introduces a Dynamic Template Capture Module (DTCM) and a Dynamic Language Update Module (DLUM) within a one-stream unified visual-language transformer (HiViT-based) to refresh visual templates and language descriptions in response to target motion and appearance changes. Through a two-stage training regimen and an update strategy guided by displacement, scale, and color changes, DUTrack achieves state-of-the-art results on four vision-language benchmarks and strong performance on two vision-only benchmarks, demonstrating improved robustness and alignment between reference text and target state. The approach offers a practical path to more reliable VL tracking in dynamic scenes by maintaining cross-modal consistency with ongoing updates, leveraging both a learned description generator and dynamic visual templates.

Abstract

The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.

Dynamic Updates for Language Adaptation in Visual-Language Tracking

TL;DR

DUTrack tackles semantic drift in vision-language tracking by dynamically updating multi-modal references, rather than relying on static initial descriptions. It introduces a Dynamic Template Capture Module (DTCM) and a Dynamic Language Update Module (DLUM) within a one-stream unified visual-language transformer (HiViT-based) to refresh visual templates and language descriptions in response to target motion and appearance changes. Through a two-stage training regimen and an update strategy guided by displacement, scale, and color changes, DUTrack achieves state-of-the-art results on four vision-language benchmarks and strong performance on two vision-only benchmarks, demonstrating improved robustness and alignment between reference text and target state. The approach offers a practical path to more reliable VL tracking in dynamic scenes by maintaining cross-modal consistency with ongoing updates, leveraging both a learned description generator and dynamic visual templates.

Abstract

The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.

Paper Structure

This paper contains 14 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of different VL Tracking. (a) This vision-language tracking framework allinonedivertmore relies on static multi-modal references. (b) Our proposes VL framework with dynamically updating multi-modal references. (c) Compare the semantic bias between static annotations and those generated by our method.
  • Figure 2: Overall framework of the proposed DUTrack. The input consists of two parts: search frame and multi-modal reference. The image and text information are transformed into tokens through Patch Embedding and Tokenizer processing, respectively. Then, these tokens enter the multi-modal interaction module for unified interaction. The resulting multi-modal features are processed through the tracking head to produce the final output. Based on this result, it is determined whether to update the multi-modal reference. The dynamic multi-modal reference is primarily responsible for generating a new reference according to the object’s state in the current frame.
  • Figure 3: Illustration of the process of capturing dynamic templates. The left side shows the unified visual-language modeling generating a global attention map, while the right side captures dynamic templates based on the global attention map.
  • Figure 4: Attribute-based evaluation on the LaSOT test set. AUC score is used to rank different trackers.
  • Figure 5: Qualitative comparison results of our tracker with two VL trackers(i.e UVLTrack and MMtrack) and one visual-only tracker OSTrak on three challenging sequences from the LaSOT benchmark. Better viewed in color with zoom-in.
  • ...and 1 more figures