Table of Contents
Fetching ...

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

Xinyu Xie, Yawen Cui, Tao Tan, Xubin Zheng, Zitong Yu

TL;DR

FusionMamba introduces a dynamic feature enhancement framework built on the Mamba state-space backbone to address the limitations of CNNs and ViTs in multimodal image fusion. It combines a Dynamic Vision State Space (DVSS) module for global context with a Dynamic Feature Fusion Module (DFFM) that houses the Dynamic Feature Enhancement Module (DFEM) and Cross-Modal Fusion Mamba Module (CMFM) to strengthen intra- and inter-modal interactions. The approach yields state-of-the-art performance across IR-VIS, multimodal medical, and biomedical fusion tasks while maintaining computational efficiency via ES2D-based long-range modeling and linear-scaling SSMs. Ablation and downstream-task experiments confirm the effectiveness of each module and loss term, highlighting FusionMamba's practical potential for real-time, high-fidelity fusion and its applicability to object detection and brain tumor segmentation.

Abstract

Multimodal image fusion aims to integrate information from different imaging techniques to produce a comprehensive, detail-rich single image for downstream vision tasks. Existing methods based on local convolutional neural networks (CNNs) struggle to capture global features efficiently, while Transformer-based models are computationally expensive, although they excel at global modeling. Mamba addresses these limitations by leveraging selective structured state space models (S4) to effectively handle long-range dependencies while maintaining linear complexity. In this paper, we propose FusionMamba, a novel dynamic feature enhancement framework that aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks. The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms, which not only retains its powerful global feature modeling capability, but also greatly reduces redundancy and enhances the expressiveness of local features. In addition, we have developed a new module called the dynamic feature fusion module (DFFM). It combines the dynamic feature enhancement module (DFEM) for texture enhancement and disparity perception with the cross-modal fusion Mamba module (CMFM), which focuses on enhancing the inter-modal correlation while suppressing redundant information. Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments, demonstrating its broad applicability and superiority.

FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba

TL;DR

FusionMamba introduces a dynamic feature enhancement framework built on the Mamba state-space backbone to address the limitations of CNNs and ViTs in multimodal image fusion. It combines a Dynamic Vision State Space (DVSS) module for global context with a Dynamic Feature Fusion Module (DFFM) that houses the Dynamic Feature Enhancement Module (DFEM) and Cross-Modal Fusion Mamba Module (CMFM) to strengthen intra- and inter-modal interactions. The approach yields state-of-the-art performance across IR-VIS, multimodal medical, and biomedical fusion tasks while maintaining computational efficiency via ES2D-based long-range modeling and linear-scaling SSMs. Ablation and downstream-task experiments confirm the effectiveness of each module and loss term, highlighting FusionMamba's practical potential for real-time, high-fidelity fusion and its applicability to object detection and brain tumor segmentation.

Abstract

Multimodal image fusion aims to integrate information from different imaging techniques to produce a comprehensive, detail-rich single image for downstream vision tasks. Existing methods based on local convolutional neural networks (CNNs) struggle to capture global features efficiently, while Transformer-based models are computationally expensive, although they excel at global modeling. Mamba addresses these limitations by leveraging selective structured state space models (S4) to effectively handle long-range dependencies while maintaining linear complexity. In this paper, we propose FusionMamba, a novel dynamic feature enhancement framework that aims to overcome the challenges faced by CNNs and Vision Transformers (ViTs) in computer vision tasks. The framework improves the visual state-space model Mamba by integrating dynamic convolution and channel attention mechanisms, which not only retains its powerful global feature modeling capability, but also greatly reduces redundancy and enhances the expressiveness of local features. In addition, we have developed a new module called the dynamic feature fusion module (DFFM). It combines the dynamic feature enhancement module (DFEM) for texture enhancement and disparity perception with the cross-modal fusion Mamba module (CMFM), which focuses on enhancing the inter-modal correlation while suppressing redundant information. Experiments show that FusionMamba achieves state-of-the-art performance in a variety of multimodal image fusion tasks as well as downstream experiments, demonstrating its broad applicability and superiority.
Paper Structure (27 sections, 14 equations, 14 figures, 13 tables)

This paper contains 27 sections, 14 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Illustration of qualitative and quantitative results of multimodal image fusion. Qualitative visualization between classical U2Fusion U2Fusion and our FusionMamba is shown in the second row while the sub-figures on the first row are source image pairs.
  • Figure 2: Overview of the framework. FusionMamba network receives two images of different modes as inputs. These images undergo multi-layer feature extraction and dynamic feature enhancement fusion through the fusion module, resulting in fusion features that include difference and texture enhancement. Finally, the module reconstructs the fusion result. I1, I2: different source images; F: fused image; DFFM: dynamic feature fusion module; DVSS: dynamic vision state space. LDC: learnable descriptive convolution; ECA: efficient channel attention; LN: LayerNorm; ESSM: efficient state space module. Linear: linear function; DwConv: depthwise separable convolution; SiLU: SiLU activation function; ES2D: the efficient 2D scanning.
  • Figure 3: Dynamic feature fusion module (DFFM). $\bm{D}_{1}^{n}$ and $\bm{D}_{2}^{n}$ are features; $\bm{F}_{1}^{n}$ and $\bm{F}_{2}^{n}$ are different modal features; $\bm{F}_{f}^{n}$ is a coarse-grained feature fusion. $\oplus$is the element-wise addition operation.
  • Figure 4: Dynamic feature enhancement module ($\mathrm{DFEM}_{1})$. $\bm{T}_{1}^{n}$ and $\bm{T}_{2}^{n}$ are enhanced feature maps. These maps are then passed through a global pooling ($\mathit{GAP\left(\cdot\right)}$)) operation and a sigmoid function ($\delta$) to compute the difference weights between the output feature maps. $\otimes$ and $\oplus$ are the element-wise multiplication and addition operations.
  • Figure 5: Cross modal fusion mamba module (CMFM). Dw-Conv: depthwise convolution; ECA: effecient channel attention; ES2D: the efficient 2D scanning. $\bm{C}_{1}^{n}$ and $\bm{C}_{2}^{n}$ are the feature maps obtained after applying depthwise convolution to the inputs $\bm{D}_{1}^{n}$ and $\bm{D}_{2}^{n}$, respectively, used to extract finer spatial features. $\bm{H}_{1}^{n}$ and $\bm{H}_{2}^{n}$ are the hybrid features.
  • ...and 9 more figures