Table of Contents
Fetching ...

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

Zhe Li, Haiwei Pan, Kejia Zhang, Yuhua Wang, Fengming Yu

TL;DR

This work tackles the inefficiency and limited performance of existing MMIF methods by introducing MambaDFuse, a Mamba-based dual-phase model. It combines a dual-level feature extractor (CNN for local cues and Mamba for long-range dependencies) with a dual-phase fusion strategy (shallow channel-exchange for global information and deep Multi-modal Mamba (M3) blocks for modality-guided detail fusion), followed by fused-image reconstruction. The approach achieves state-of-the-art results on infrared-visible and medical image fusion benchmarks and provides measurable improvements in downstream object detection, validating its practical impact. By leveraging State Space Models within a hardware-aware Mamba framework, the method delivers both efficiency and effectiveness, offering a scalable backbone for real-time MMIF tasks and broader vision applications.

Abstract

Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image to represent the imaging scene and facilitate downstream visual tasks comprehensively. In recent years, significant progress has been made in MMIF tasks due to advances in deep neural networks. However, existing methods cannot effectively and efficiently extract modality-specific and modality-fused features constrained by the inherent local reductive bias (CNN) or quadratic computational complexity (Transformers). To overcome this issue, we propose a Mamba-based Dual-phase Fusion (MambaDFuse) model. Firstly, a dual-level feature extractor is designed to capture long-range features from single-modality images by extracting low and high-level features from CNN and Mamba blocks. Then, a dual-phase feature fusion module is proposed to obtain fusion features that combine complementary information from different modalities. It uses the channel exchange method for shallow fusion and the enhanced Multi-modal Mamba (M3) blocks for deep fusion. Finally, the fused image reconstruction module utilizes the inverse transformation of the feature extraction to generate the fused result. Through extensive experiments, our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion. Additionally, in a unified benchmark, MambaDFuse has also demonstrated improved performance in downstream tasks such as object detection. Code with checkpoints will be available after the peer-review process.

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

TL;DR

This work tackles the inefficiency and limited performance of existing MMIF methods by introducing MambaDFuse, a Mamba-based dual-phase model. It combines a dual-level feature extractor (CNN for local cues and Mamba for long-range dependencies) with a dual-phase fusion strategy (shallow channel-exchange for global information and deep Multi-modal Mamba (M3) blocks for modality-guided detail fusion), followed by fused-image reconstruction. The approach achieves state-of-the-art results on infrared-visible and medical image fusion benchmarks and provides measurable improvements in downstream object detection, validating its practical impact. By leveraging State Space Models within a hardware-aware Mamba framework, the method delivers both efficiency and effectiveness, offering a scalable backbone for real-time MMIF tasks and broader vision applications.

Abstract

Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image to represent the imaging scene and facilitate downstream visual tasks comprehensively. In recent years, significant progress has been made in MMIF tasks due to advances in deep neural networks. However, existing methods cannot effectively and efficiently extract modality-specific and modality-fused features constrained by the inherent local reductive bias (CNN) or quadratic computational complexity (Transformers). To overcome this issue, we propose a Mamba-based Dual-phase Fusion (MambaDFuse) model. Firstly, a dual-level feature extractor is designed to capture long-range features from single-modality images by extracting low and high-level features from CNN and Mamba blocks. Then, a dual-phase feature fusion module is proposed to obtain fusion features that combine complementary information from different modalities. It uses the channel exchange method for shallow fusion and the enhanced Multi-modal Mamba (M3) blocks for deep fusion. Finally, the fused image reconstruction module utilizes the inverse transformation of the feature extraction to generate the fused result. Through extensive experiments, our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion. Additionally, in a unified benchmark, MambaDFuse has also demonstrated improved performance in downstream tasks such as object detection. Code with checkpoints will be available after the peer-review process.
Paper Structure (15 sections, 6 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 15 sections, 6 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Fusion, detection and efficiency&effectiveness comparisons with state-of-the-art methods on MSRS and MRI-CT datasets. Octagons formed by lines of different colors represent the values of different methods across eight metrics. Our MambaDFuse outperforms the most comprehensive performance. The bubble chart illustrates the comparative analysis of efficiency and effectiveness, and the numbers inside the circles represent the time required to fuse a pair of images. Methods achieving similar fusion performance to ours demonstrate a slower fusion rate. Conversely, methods with a slightly faster rate than ours exhibit significantly lower fusion effectiveness. The fusion and detection results also showcase the powerful fusion capabilities of MambaDFuse. (The fusion metrics used in the chart are computed after normalization. The horizontal axis of the bubble chart represents time, while the vertical axis represents the sum of metrics.)
  • Figure 2: The overall architecture of MambaDFuse. It consists of three stages: dual-level feature extraction, dual-phase feature fusion, and fused image reconstruction.
  • Figure 3: (a) is the implementation details of the shallow fuse module. (b) is channel exchange process and grad-cam selvaraju2017grad visualization results. The visualization is calculated for the output of Conv in the shallow fuse module. The heatmap demonstrates that the image has integrated information from the other modality after channel exchange, contributing to a shallow fuse.
  • Figure 4: The implementation details of the deep fuse module.
  • Figure 5: Visual comparison for “00718N” in MSRS IVF dataset.
  • ...and 4 more figures