Table of Contents
Fetching ...

Temporal Adaptive RGBT Tracking with Modality Prompt

Hongyu Wang, Xiaotao Liu, Yifan Li, Meng Sun, Dian Yuan, Jing Liu

TL;DR

TATrack tackles RGB-T tracking by explicitly leveraging temporal information through an online-updated template and a spatio-temporal interaction mechanism that bridges initial and online templates. It integrates modality prompts within a ViT-based two-stream architecture to fuse RGB and TIR data more effectively, enabling long-range cross-modal interactions. The approach achieves state-of-the-art results on major RGB-T benchmarks while maintaining real-time speed, and ablations confirm the importance of the online template, STI, and modality prompts. This work advances multimodal tracking by unified spatial fusion, temporal adaptation, and cross-modal communication in a single framework.

Abstract

RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatio-temporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed.

Temporal Adaptive RGBT Tracking with Modality Prompt

TL;DR

TATrack tackles RGB-T tracking by explicitly leveraging temporal information through an online-updated template and a spatio-temporal interaction mechanism that bridges initial and online templates. It integrates modality prompts within a ViT-based two-stream architecture to fuse RGB and TIR data more effectively, enabling long-range cross-modal interactions. The approach achieves state-of-the-art results on major RGB-T benchmarks while maintaining real-time speed, and ablations confirm the importance of the online template, STI, and modality prompts. This work advances multimodal tracking by unified spatial fusion, temporal adaptation, and cross-modal communication in a single framework.

Abstract

RGBT tracking has been widely used in various fields such as robotics, surveillance processing, and autonomous driving. Existing RGBT trackers fully explore the spatial information between the template and the search region and locate the target based on the appearance matching results. However, these RGBT trackers have very limited exploitation of temporal information, either ignoring temporal information or exploiting it through online sampling and training. The former struggles to cope with the object state changes, while the latter neglects the correlation between spatial and temporal information. To alleviate these limitations, we propose a novel Temporal Adaptive RGBT Tracking framework, named as TATrack. TATrack has a spatio-temporal two-stream structure and captures temporal information by an online updated template, where the two-stream structure refers to the multi-modal feature extraction and cross-modal interaction for the initial template and the online update template respectively. TATrack contributes to comprehensively exploit spatio-temporal information and multi-modal information for target localization. In addition, we design a spatio-temporal interaction (STI) mechanism that bridges two branches and enables cross-modal interaction to span longer time scales. Extensive experiments on three popular RGBT tracking benchmarks show that our method achieves state-of-the-art performance, while running at real-time speed.
Paper Structure (12 sections, 7 equations, 4 figures, 6 tables)

This paper contains 12 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Differences between our RGBT tracking approach and previous ones. (a) Sampling from tracking result and training online. (b) Processing RGB and TIR images separately and performing cross-modal interaction through specialized networks. (c) Performing feature extraction and cross-modal interaction simultaneously and capturing temporal information by online updated templates.
  • Figure 2: The overall framework of TATrack. The triplet of the search area, the initial template, and the online template is first embedded into tokens by the patch embed. The initial branch and the online branch respectively refer to the initial template and the online template for feature extraction and cross-modal interaction. The prompter generates modality prompts and adjusts inputs for the transformer encoder. STI enables the cross-frame propagation of spatio-temporal and multi-modal information.
  • Figure 3: The processing in STI and MCP. TATrack gets a robust and precise representation of the target object by combining spatio-temporal information with multi-modal information.
  • Figure 4: Visualization of response maps. The first row shows the RGB search region with a green bounding box. The second row shows the TIR search region. The third row shows the response map of the search region.