DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Xuchen Li; Xiaokun Feng; Shiyu Hu; Meiqi Wu; Dailing Zhang; Jing Zhang; Kaiqi Huang

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, Kaiqi Huang

TL;DR

DTLLM-VLT is introduced, which automatically generates extensive and multi-granularity text to enhance environmental diversity and leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives, enabling fine-grained evaluation of multi-modal trackers.

Abstract

Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. By leveraging high-level semantic information, VLT guides object tracking, alleviating the constraints associated with relying on a visual modality. Nevertheless, most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover, coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges, we introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless integration into various visual tracking benchmarks. (2) We select three prominent benchmarks to deploy our approach: short-term tracking, long-term tracking, and global instance tracking. We offer four granularity combinations for these benchmarks, considering the extent and density of semantic information, thereby showcasing the practicality and versatility of DTLLM-VLT. (3) We conduct comparative experiments on VLT benchmarks with different text granularities, evaluating and analyzing the impact of diverse text on tracking performance. Conclusionally, this work leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives, enabling fine-grained evaluation of multi-modal trackers. In the future, we believe this work can be extended to more datasets to support vision datasets understanding.

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

TL;DR

Abstract

Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Introduction
Related Work
Single Object Tracking Benchmark
Visual Language Tracking Benchmark
Algorithms for Visual Language Tracking
Text Generation by LLM
Generation Strategy
DTLLM-VLT
Generation Analysis
Speed and Memory Usage
Experimental Results
Datasets and Evaluation Methods
Tracking Results
Testing Directly
Retraining and Testing Respectively
...and 3 more sections

Figures (5)

Figure 1: Examples of video content and semantic descriptions on OTB99_Lang otb99, LaSOT lasot, and MGIT mgit benchmarks. The green bounding box (BBox) indicates ground truth, while the red dashed BBox indicates other objects that satisfy the semantic description. (a) and (b) are short sequences in OTB99_Lang with simple narrative content. Besides, their semantic annotations mainly describe the first frame, which may misguide the algorithm. (c) Comparison of different text annotations, video length, and content on three benchmarks. The VLT environment is complex, variable and most of them suffer from issues of inconsistent text styles and single annotation granularity.
Figure 2: Comparison of Manual Annotation and Automatic Generation and Framework of DTLLM-VLT. (a) Manual annotation relies on human labor, only provides one text annotation for each video segment, and cannot guarantee a uniform style. The cost of large-scale annotation is too high. (b) Automatic Generation can generate diverse text on a large-scale in a unified style. (c) The DTLLM-VLT can provide dense concise/detailed text generation based on given video frames and BBox of object.
Figure 3: The word cloud of semantic descriptions and word count statistics.
Figure 4: Examples of the four types of generated text. We provide four different natural language descriptions for each video. The object to be tracked is determined in the first frame and does not change throughout the video sequence.
Figure 5: Visualization of tracking results on dense concise text annotations retrained algorithm.

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

TL;DR

Abstract

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (5)