Table of Contents
Fetching ...

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

Chenguang Zhu, Shan Gao, Huafeng Chen, Guangqian Guo, Chaowei Wang, Yaoxing Wang, Chen Shu Lei, Quanjiang Fan

TL;DR

A dual-branch image fusion network called Tmamba, which consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity, and proposes cross-modal interaction at the attention level to obtain cross-modal attention.

Abstract

Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

TL;DR

A dual-branch image fusion network called Tmamba, which consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity, and proposes cross-modal interaction at the attention level to obtain cross-modal attention.

Abstract

Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.
Paper Structure (29 sections, 15 equations, 5 figures, 4 tables)

This paper contains 29 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) and (b) visualize the features extracted by the Transformer and Mamba branches. It can be seen that the feature patterns of the two branches are very different. (c) Comparison of EN and MI values of fused images by different methods. The fusion results obtained by our method are rich in information and well preserve the input image information.
  • Figure 2: (a) The structure of the Tmamba block. (b) The structure of the Cross-modality Interaction module.We use the DASSP block c:36 to assign optimal weights to the attention of the two modalities.
  • Figure 3: Visual comparison for “18” in sandpath of TNO IVF dataset.
  • Figure 4: Visual comparison for “FLIR 06506” in RoadScene IVF dataset.
  • Figure 5: Visual comparison for “MRI-CT 14” in MRI-CT MIF dataset.