Table of Contents
Fetching ...

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

Xuchen Li, Shiyu Hu, Xiaokun Feng, Dailing Zhang, Meiqi Wu, Jing Zhang, Kaiqi Huang

TL;DR

This work utilizes LLMs to generate varied semantic annotations for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking.

Abstract

Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

TL;DR

This work utilizes LLMs to generate varied semantic annotations for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking.

Abstract

Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.
Paper Structure (30 sections, 2 equations, 16 figures, 7 tables)

This paper contains 30 sections, 2 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Comparison of DTVLT and other VLT benchmarks. (a-c) Examples of video content and semantic descriptions on OTB99_Lang (otb99), LaSOT (lasot), and MGIT (mgit). The green bounding box (BBox) indicates ground truth, while the red dashed BBox indicates other objects that satisfy the semantic description. (a) and (b) are sequences with simple narrative content. And their semantic annotations mainly describe the first frame, which may misguide the algorithm bacause of misleading text and multiple qualified target. In the VLT task, if the error caused by incorrect text accumulates for the tracker, it will have an irreversible impact on the tracking results. (c) in MGIT has such complex text that they are not conducive to algorithmic learning. (d) An example of the multi-granular generation strategy used by DTVLT. We provide more diverse concise and detailed descriptions for each hundred frames of the object to be tracked, covering five representative datasets across three mainstream tracking tasks. The term “# xx” represents the frame ID. Compared to existing benchmarks, the generated text provides more prosperous and flexible information to portray long videos.
  • Figure 2: The pipeline of text generation for DTVLT based on DTLLM-VLT, which can provide dense concise/detailed text generation based on given video frames and BBox of object.
  • Figure 3: Examples of the text generation in DTVLT. We provide four different natural language descriptions for each video. Diverse multi-granularity text can support fine-grained evaluation of trackers, providing guidance for the development of tracking.
  • Figure 4: The word cloud of semantic descriptions.
  • Figure 5: Comparison with retraining for 50 epochs and testing on DTVLT. We plot the performance differences between the model after retraining and direct testing, where the red line represents the mean of these differences.
  • ...and 11 more figures