RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Hao Li; Yuhao Wang; Wenning Hao; Pingping Zhang; Dong Wang; Huchuan Lu

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Hao Li, Yuhao Wang, Wenning Hao, Pingping Zhang, Dong Wang, Huchuan Lu

TL;DR

This work introduces textual descriptions into RGBT tracking benchmarks and proposes RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking that achieves state-of-the-art performance across various challenging scenarios.

Abstract

RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling, failing to adapt to appearance variations due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios. The source code is available https://github.com/IdolLab/RAGTrack.

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

TL;DR

Abstract

Paper Structure (18 sections, 15 equations, 10 figures, 8 tables)

This paper contains 18 sections, 15 equations, 10 figures, 8 tables.

Introduction
Related Work
RGB-Thermal Tracking
RGB-Language Tracking
Retrieval-Augmented Generation
Methodology
Overall Framework
Multi-modal Transformer Encoder
Adaptive Token Fusion
Context-aware Reasoning Module
Prediction Head and Loss Function
Experiment
Datasets and Evaluation Metrics
Implementation Details
Comparison with State-of-the-Art Trackers
...and 3 more sections

Figures (10)

Figure 1: Comparison with different RGBT tracking paradigms. (a) Existing RGBT trackers suffer from inadequate appearance modeling, search redundancy, and modality gap. (b) Our RAGTrack introduces linguistic reasoning, dynamic token selection, and adaptive channel exchange.
Figure 2: Overall framework. Our method begins by tokenizing input texts and images with reasoning tokens. MTE then performs unified visual-language modeling, while ATF utilizes text-guided attention to dynamically select target-relevant tokens and enables adaptive channel exchange. Subsequently, CRM retrieves relevant contexts from a dynamic knowledge base for context-aware reasoning. Finally, the prediction head outputs tracking results, which are used by MLLMs to generate updated textual descriptions for following frames.
Figure 3: Details of our proposed ATF.
Figure 4: Attribute-based evaluations on the LasHeR dataset.
Figure 5: Comparison with different hyper-parameters.
...and 5 more figures

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

TL;DR

Abstract

RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)