Table of Contents
Fetching ...

BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and Temporal-Modal Candidate Elimination

Zhongxuan Zhang, Bi Zeng, Xinyu Ni, Yimin Du

TL;DR

BTMTrack addresses robust RGB-T tracking under challenging conditions by integrating temporal dynamics and cross-modal fusion through a dual-template ViT backbone. The proposed Temporal-Modal Candidate Elimination (TMCE) prunes target-relevant tokens by jointly evaluating temporal and cross-modal cues, reducing background noise and computational cost. The Temporal Dual Template Bridging (TDTB) module further strengthens cross-modal interactions by bidirectionally fusing static and dynamic templates with search regions from RGB and TIR. Empirical results on LasHeR, RGBT210, and RGBT234 show state-of-the-art performance, with clear ablations demonstrating the contribution of TMCE and TDTB to both accuracy and efficiency.

Abstract

RGB-T tracking leverages the complementary strengths of RGB and thermal infrared (TIR) modalities to address challenging scenarios such as low illumination and adverse weather. However, existing methods often fail to effectively integrate temporal information and perform efficient cross-modal interactions, which constrain their adaptability to dynamic targets. In this paper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of our approach lies in the dual-template backbone network and the Temporal-Modal Candidate Elimination (TMCE) strategy. The dual-template backbone effectively integrates temporal information, while the TMCE strategy focuses the model on target-relevant tokens by evaluating temporal and modal correlations, reducing computational overhead and avoiding irrelevant background noise. Building upon this foundation, we propose the Temporal Dual Template Bridging (TDTB) module, which facilitates precise cross-modal fusion through dynamically filtered tokens. This approach further strengthens the interaction between templates and the search region. Extensive experiments conducted on three benchmark datasets demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art performance, with a 72.3% precision rate on the LasHeR test set and competitive results on RGBT210 and RGBT234 datasets.

BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and Temporal-Modal Candidate Elimination

TL;DR

BTMTrack addresses robust RGB-T tracking under challenging conditions by integrating temporal dynamics and cross-modal fusion through a dual-template ViT backbone. The proposed Temporal-Modal Candidate Elimination (TMCE) prunes target-relevant tokens by jointly evaluating temporal and cross-modal cues, reducing background noise and computational cost. The Temporal Dual Template Bridging (TDTB) module further strengthens cross-modal interactions by bidirectionally fusing static and dynamic templates with search regions from RGB and TIR. Empirical results on LasHeR, RGBT210, and RGBT234 show state-of-the-art performance, with clear ablations demonstrating the contribution of TMCE and TDTB to both accuracy and efficiency.

Abstract

RGB-T tracking leverages the complementary strengths of RGB and thermal infrared (TIR) modalities to address challenging scenarios such as low illumination and adverse weather. However, existing methods often fail to effectively integrate temporal information and perform efficient cross-modal interactions, which constrain their adaptability to dynamic targets. In this paper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of our approach lies in the dual-template backbone network and the Temporal-Modal Candidate Elimination (TMCE) strategy. The dual-template backbone effectively integrates temporal information, while the TMCE strategy focuses the model on target-relevant tokens by evaluating temporal and modal correlations, reducing computational overhead and avoiding irrelevant background noise. Building upon this foundation, we propose the Temporal Dual Template Bridging (TDTB) module, which facilitates precise cross-modal fusion through dynamically filtered tokens. This approach further strengthens the interaction between templates and the search region. Extensive experiments conducted on three benchmark datasets demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art performance, with a 72.3% precision rate on the LasHeR test set and competitive results on RGBT210 and RGBT234 datasets.
Paper Structure (15 sections, 18 equations, 4 figures, 5 tables)

This paper contains 15 sections, 18 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of our cross-modal fusion approach with previous methods. (a) VIPT, injects TIR modality information as prompt-based auxiliary input into the RGB modality network. (b) TBSI, uses template tokens as a bridge to mediate interactions between the search regions of the two modalities. (c) Our model, filters target-relevant search region tokens before performing dual-temporal template bridging. (d) Example of a Transformer block.
  • Figure 2: The overall framework of our method. It integrates static and dynamic templates with search regions from RGB and TIR patches. These patches are tokenized and processed by a ViT backbone for feature extraction. The proposed TMCE strategy filters tokens based on temporal and modal relevance to reduce background noise. The TDTB module enables interactions between dual-temporal templates and search regions of both modalities. Finally, fused RGB and TIR features are passed to the tracking head to predict the target's location.
  • Figure 3: The diagram demonstrates the process of dual-temporal template fusion and six MHCA operations within the TDTB module. For clearer presentation, we omit details such as LN, MLP, and the residual connections typically performed in each Transformer block.
  • Figure 4: Qualitative comparison of our method with other RGB-T trackers on four representative sequences from the LasHeR dataset.