Table of Contents
Fetching ...

Cross-modulated Attention Transformer for RGBT Tracking

Yun Xiao, Jiacong Zhao, Andong Lu, Chenglong Li, Yin Lin, Bing Yin, Cong Liu

TL;DR

This work tackles the problem of inconsistent and potentially misleading correlation weights in Transformer-based RGBT tracking by proposing Cross-modulated Attention Transformer (CAFormer). CAFormer unifies intra- and inter-modality feature interactions through a Correlation Modulated Enhancement (CME) module and introduces a bidirectional Cross-Modulated Attention (CMA) mechanism to adapt correlation weights across modalities, complemented by a Collaborative Token Elimination (CTE) strategy to boost efficiency. Empirical results on five public datasets show state-of-the-art performance with high inference speed (up to 83.6 FPS), and ablation studies validate the contribution of CMA and CTE to tracking accuracy and efficiency. The approach offers a novel fusion paradigm for multi-modal tracking that emphasizes correlation consistency over traditional feature fusion, with potential for further gains by combining correlation and feature fusion in future work.

Abstract

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and template-search correlation computation. Nevertheless, the independent search-template correlation calculations ignore the consistency between branches, which can result in ambiguous and inappropriate correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality self-correlation, inter-modality feature interaction, and search-template correlation computation in a unified attention model, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed Correlation Modulated Enhancement module, modulating inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Extensive experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.

Cross-modulated Attention Transformer for RGBT Tracking

TL;DR

This work tackles the problem of inconsistent and potentially misleading correlation weights in Transformer-based RGBT tracking by proposing Cross-modulated Attention Transformer (CAFormer). CAFormer unifies intra- and inter-modality feature interactions through a Correlation Modulated Enhancement (CME) module and introduces a bidirectional Cross-Modulated Attention (CMA) mechanism to adapt correlation weights across modalities, complemented by a Collaborative Token Elimination (CTE) strategy to boost efficiency. Empirical results on five public datasets show state-of-the-art performance with high inference speed (up to 83.6 FPS), and ablation studies validate the contribution of CMA and CTE to tracking accuracy and efficiency. The approach offers a novel fusion paradigm for multi-modal tracking that emphasizes correlation consistency over traditional feature fusion, with potential for further gains by combining correlation and feature fusion in future work.

Abstract

Existing Transformer-based RGBT trackers achieve remarkable performance benefits by leveraging self-attention to extract uni-modal features and cross-attention to enhance multi-modal feature interaction and template-search correlation computation. Nevertheless, the independent search-template correlation calculations ignore the consistency between branches, which can result in ambiguous and inappropriate correlation weights. It not only limits the intra-modal feature representation, but also harms the robustness of cross-attention for multi-modal feature interaction and search-template correlation computation. To address these issues, we propose a novel approach called Cross-modulated Attention Transformer (CAFormer), which performs intra-modality self-correlation, inter-modality feature interaction, and search-template correlation computation in a unified attention model, for RGBT tracking. In particular, we first independently generate correlation maps for each modality and feed them into the designed Correlation Modulated Enhancement module, modulating inaccurate correlation weights by seeking the consensus between modalities. Such kind of design unifies self-attention and cross-attention schemes, which not only alleviates inaccurate attention weight computation in self-attention but also eliminates redundant computation introduced by extra cross-attention scheme. In addition, we propose a collaborative token elimination strategy to further improve tracking inference efficiency and accuracy. Extensive experiments on five public RGBT tracking benchmarks show the outstanding performance of the proposed CAFormer against state-of-the-art methods.
Paper Structure (19 sections, 9 equations, 5 figures, 5 tables)

This paper contains 19 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of performance and speed for state-of-the-art tracking methods on RGBT234 2019RGBT234. We visualize the Success Rate (SR) to the Frames Per Second (FPS). Closer to the top means higher performance, and closer to the right means faster. CAFormer is able to rank the 1st in SR while running at 83.6 FPS.
  • Figure 2: Illustration of correlation maps with different fusion methods under different modal quality inputs. The score map is the output of the location branch in the tracking head.
  • Figure 3: Overall framework of Cross-modulated Attention Transformer (CAFormer) for RGBT tracking.
  • Figure 4: The proposed Cross-modulated Attention with the Correlation Modulated Enhancement (CME) module. denotes dividing the features of two modalities, $\bigotimes$ denotes matrix multiplication, and $\bigoplus$ denotes element-wise addition. The numbers beside the arrows are feature dimensions that do not include the batch size. Linear projections in (a) and matrix transpose operations are omitted for brevity. $N$, $N_x$, and $N_z$ represent all token numbers, search region token numbers, and template region token numbers, respectively.
  • Figure 5: Attribute-based evaluation on RGBT234 dataset. In parentheses, the value on the left indicates the minimum success rate, and on the right the maximum success rate.