Table of Contents
Fetching ...

Cross Fusion RGB-T Tracking with Bi-directional Adapter

Zhirong Zeng, Xiaotao Liu, Meng Sun, Hongyu Wang, Jing Liu

TL;DR

A novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information, and achieves new state-of-the-art performance.

Abstract

Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3\% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.

Cross Fusion RGB-T Tracking with Bi-directional Adapter

TL;DR

A novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information, and achieves new state-of-the-art performance.

Abstract

Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3\% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.
Paper Structure (20 sections, 10 equations, 6 figures, 3 tables)

This paper contains 20 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Differences between our RGB-T tracking approach and previous ones. (a) Utilizing the TIR modality to assist the RGB modality. (b) Facilitating interaction between the TIR and RGB modalities. (c) Enhancing the templates of the RGB modality in different branches through interaction. (d) Using the RGB modality as a medium to enhance template interaction, complement the search regions through interaction, and transmit temporal information for deep cross-modal interaction.
  • Figure 2: The overall architecture of our proposed CFBT. First, we embed the image patches and then concatenate the initial template, online template, and search regions. These concatenated inputs are passed through $N$-layer of transformer encoders. Within these layers, the BA modules are inserted into each transformer encoder layer to enable cross-modal interaction. CSTAF and CSTCF modules are added at the 4th, 7th, and 10th layers to facilitate temporal information interaction. Additionally, DSTA modules are applied at the 5th, 6th, and 11th layers to further enhance deep temporal information interaction. Finally, the output features of the two branches are added and fed into the prediction head for final tracking result.
  • Figure 3: The overall framework of the Cross Spatio-Temporal Fusion module consists of three main components: down-projection linear layer, up-projection linear layer, cross attention layer.
  • Figure 4: The detailed architecture of DSTA and BA. DSTA includes three BA modules, where we freeze the BA modules between modalities and only train the BA modules within the different branches. Each BA module consists of three linear layers: down-projection linear layer, linear projection layer, up-projection linear layer.
  • Figure 5: Further comparisons of CFBT and the competing methods under different attributes in the LasHeR dataset.
  • ...and 1 more figures