Table of Contents
Fetching ...

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Dengdi Sun, Yajie Pan, Andong Lu, Chenglong Li, Bin Luo

TL;DR

This work tackles robust RGBT tracking under appearance changes by integrating temporal dynamics without corrupting the original target template. It introduces STMT, a Spatio-Temporal Multimodal Tokens module that jointly fuses RGB and TIR templates with dynamic tokens and cross-modal enhancement, inserted into a ViT-based tracker. A temporal training strategy enables end-to-end learning of temporal fusion within a single network, while multimodal dynamic tokens capture target variation across frames. Extensive experiments on RGBT210, RGB234, and LasHeR demonstrate competitive or state-of-the-art performance with real-time speed, validating the approach’s effectiveness and practical impact for multimodal tracking in challenging environments.

Abstract

Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

TL;DR

This work tackles robust RGBT tracking under appearance changes by integrating temporal dynamics without corrupting the original target template. It introduces STMT, a Spatio-Temporal Multimodal Tokens module that jointly fuses RGB and TIR templates with dynamic tokens and cross-modal enhancement, inserted into a ViT-based tracker. A temporal training strategy enables end-to-end learning of temporal fusion within a single network, while multimodal dynamic tokens capture target variation across frames. Extensive experiments on RGBT210, RGB234, and LasHeR demonstrate competitive or state-of-the-art performance with real-time speed, validating the approach’s effectiveness and practical impact for multimodal tracking in challenging environments.

Abstract

Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.
Paper Structure (22 sections, 17 equations, 6 figures, 4 tables)

This paper contains 22 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison between our approach and previous ones. (a) Non-temporal method that just focuses on modality fusion. (b) Previous RGBT tracking approaches introduced entire historical frames to incorporate temporal information. (c) The single-modal method introduces temporal information by replacing the initial template. (d) Our approach mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes.
  • Figure 2: The overall framework of our proposed Spatio-Temporal Multimodal Tokens Transformer framework for RGBT Tracking. RGB and TIR image patches are embedded as tokens and fed into Transformer blocks for joint feature extraction and intra-modal search-template matching. In the proposed module, T represents the current frame time, and T-1 represents the time of the previous frame. We first extract the search regions of both modalities and form them into dynamic tokens for the next time step. Then, we perform modality enhancement on the static reliable templates to provide modality interaction cues in the subsequent encoding layers for joint feature extraction. Simultaneously, we integrate the dynamic tokens that from the previous time step into the current search region to provide information about target variations.
  • Figure 3: Conceptual illustration of Spatio-Temporal Multimodal Tokens (STMT) module. For clarity, we only present the core design aspects and omit details such as template updates and operations like LN and MLP.
  • Figure 4: The extraction process of multimodal dynamic tokens is demonstrated. The current frame at the current time (T) step is passed through a process of reshaping, ROI cropping, and another reshaping to obtain dynamic tokens. These dynamic tokens are preserved for the next time step, and the dynamic tokens from the previous time step (T-1) are also input to the network for the current time step.
  • Figure 5: Some visual cases of tracking result on RGBT234. It shows the comparison between STMT and the baseline is demonstrated on two sequences, where the blue tracking boxes represent the results of the tracker, while the green boxes indicate the ground truth.
  • ...and 1 more figures