Table of Contents
Fetching ...

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Andong Lu, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo

TL;DR

This work tackles robust RGB-T tracking by enabling all-layer cross-modal interactions with a scalable approach. It introduces AINet, which uses a Difference-based Fusion Mamba (DFM) for per-layer modality enhancement and an Order-dynamic Fusion Mamba (OFM) to realize efficient all-layer interactions via dynamic, input-aware scanning. By embedding these modules in a two-stream ViT backbone and applying progressive fusion, AINet achieves state-of-the-art results across four public datasets while maintaining favorable computational efficiency. Extensive ablations substantiate the contributions of DFM and OFM, the benefits of all-layer utilization, and the linear scaling of resources, underscoring the method’s practical impact for robust multimodal tracking. The work also points to pathways for future improvements, including adopting a Mamba-centric backbone and model distillation to further enhance efficiency and applicability.

Abstract

Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

TL;DR

This work tackles robust RGB-T tracking by enabling all-layer cross-modal interactions with a scalable approach. It introduces AINet, which uses a Difference-based Fusion Mamba (DFM) for per-layer modality enhancement and an Order-dynamic Fusion Mamba (OFM) to realize efficient all-layer interactions via dynamic, input-aware scanning. By embedding these modules in a two-stream ViT backbone and applying progressive fusion, AINet achieves state-of-the-art results across four public datasets while maintaining favorable computational efficiency. Extensive ablations substantiate the contributions of DFM and OFM, the benefits of all-layer utilization, and the linear scaling of resources, underscoring the method’s practical impact for robust multimodal tracking. The work also points to pathways for future improvements, including adopting a Mamba-centric backbone and model distillation to further enhance efficiency and applicability.

Abstract

Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.
Paper Structure (15 sections, 11 equations, 5 figures, 4 tables)

This paper contains 15 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison with existing RGBT tracking methods. (a) Interactions between specific layers, with joint fine-tuning of the entire backbone. (b) Interactions between all corresponding layers, with the pre-trained backbone being frozen. (c) Interactions between all corresponding layers, and interactions among all layers, with joint fine-tuning with the backbone. (d) Performance comparison on LasHeR, and comparison of additional parameters and GFLOPs.
  • Figure 2: The overall architecture of our proposed AINet. Firstly, RGB and TIR images are embedded as tokens and fed into Transformer blocks for joint feature extraction and relationship modeling between search and template images. Following each block, the tokens from both modalities are processed by the DFM for difference information enhancement and then returned to the backbone. Meanwhile, the fusion features at each layer are cascaded and fed into the OFM for all-layer interaction. Finally, the output features from the OFM are sent to the tracking head for target localization.
  • Figure 3: Illustration of fusion feature visualization with different layers applied. Here, “n” in AINet-“n” represents the number of layers applied.
  • Figure 4: Precision Rate (PR) of challenge attributes on LasHeR. The axes of each attribute have been normalized.
  • Figure 5: Comparison of GPU memory usage between our proposed framework and a transformer-based approach under variations in layer count and resolution. The blue line indicates resolution variation, and the green line indicates variation in layer count.