Table of Contents
Fetching ...

Context-Aware Integration of Language and Visual References for Natural Language Tracking

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen

TL;DR

This work addresses tracking by natural language specification (TNL) by proposing QueryNLT, a unified multimodal framework that integrates language and visual references in an end-to-end manner. It introduces a prompt modulation module to filter and align dynamic language cues with historical visual templates, and a Deformable-DETR–style target decoding module that jointly retrieves the target from the search image using multi-modal prompts. The approach demonstrates that treating language- and appearance-based matching as a single instance retrieval task—supported by a dynamic template memory—improves both accuracy and temporal consistency, achieving competitive or state-of-the-art results on TNL2K, OTB-Lang, LaSOT, and RefCOCOg. The proposed method offers robust performance across diverse tracking scenarios, underscoring the practical impact of integrated visual-language reasoning for natural language tracking.

Abstract

Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.

Context-Aware Integration of Language and Visual References for Natural Language Tracking

TL;DR

This work addresses tracking by natural language specification (TNL) by proposing QueryNLT, a unified multimodal framework that integrates language and visual references in an end-to-end manner. It introduces a prompt modulation module to filter and align dynamic language cues with historical visual templates, and a Deformable-DETR–style target decoding module that jointly retrieves the target from the search image using multi-modal prompts. The approach demonstrates that treating language- and appearance-based matching as a single instance retrieval task—supported by a dynamic template memory—improves both accuracy and temporal consistency, achieving competitive or state-of-the-art results on TNL2K, OTB-Lang, LaSOT, and RefCOCOg. The proposed method offers robust performance across diverse tracking scenarios, underscoring the practical impact of integrated visual-language reasoning for natural language tracking.

Abstract

Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.
Paper Structure (14 sections, 9 equations, 6 figures, 4 tables)

This paper contains 14 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Given a video sequence, the tracking object is characterized as "white bird on the left" of the initial frame. Existing two-step approaches separately perform language-search matching (a) and appearance-search matching (b). However, "on the left" which is inconsistent with the current target and the background contained in the grounded target may confuse the identification of the target. In contrast, our QueryNLT (c) forms a dynamic and context-aware query for target localization by integrating visual and language references. (Zoom in for a better view).
  • Figure 2: Overview of our proposed framework. It comprises three key components: a feature extraction module for extracting image and text features, a prompt modulation module that generates precise appearance and language descriptions of the target, and a target decoding module that jointly establishes the correlation between the search image and the multi-modal target prompts for target retrieval.
  • Figure 3: Architecture of the proposed language prompt modulation module (a) and the appearance modulation module (b).
  • Figure 4: Architecture of the proposed target decoding module.
  • Figure 5: Qualitative comparisons of the proposed QueryNLT with the state-of-the-art trackers on three challenging sequences. Our QueryNLT can accurately target locations even when objects suffer from severe appearance variations, background clutters, and similar distractors.
  • ...and 1 more figures