Table of Contents
Fetching ...

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Hao Li, Yuhao Wang, Wenning Hao, Pingping Zhang, Dong Wang, Huchuan Lu

TL;DR

This work introduces textual descriptions into RGBT tracking benchmarks and proposes RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking that achieves state-of-the-art performance across various challenging scenarios.

Abstract

RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

TL;DR

This work introduces textual descriptions into RGBT tracking benchmarks and proposes RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking that achieves state-of-the-art performance across various challenging scenarios.

Abstract

RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.
Paper Structure (18 sections, 15 equations, 10 figures, 8 tables)

This paper contains 18 sections, 15 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison with different RGBT tracking paradigms. (a) Existing RGBT trackers suffer from inadequate appearance modeling, search redundancy, and modality gap. (b) Our RAGTrack introduces linguistic reasoning, dynamic token selection, and adaptive channel exchange.
  • Figure 2: Overall framework. Our method begins by tokenizing input texts and images with reasoning tokens. MTE then performs unified visual-language modeling, while ATF utilizes text-guided attention to dynamically select target-relevant tokens and enables adaptive channel exchange. Subsequently, CRM retrieves relevant contexts from a dynamic knowledge base for context-aware reasoning. Finally, the prediction head outputs tracking results, which are used by MLLMs to generate updated textual descriptions for following frames.
  • Figure 3: Details of our proposed ATF.
  • Figure 4: Attribute-based evaluations on the LasHeR dataset.
  • Figure 5: Comparison with different hyper-parameters.
  • ...and 5 more figures