Table of Contents
Fetching ...

Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion

Yixin Zhu, Long Lv, Pingping Zhang, Xuehu Liu, Tongdan Tang, Feng Tian, Weibing Sun, Huchuan Lu

TL;DR

This work tackles multimodal image fusion by introducing ISFM, a framework that enables interaction between spatial and frequency-domain information using a Mamba-based architecture. It combines a Modality-Specific Extractor, a Multi-scale Frequency Fusion that decomposes features into low- and high-frequency bands via Discrete Wavelet Transform, and an Interactive Spatial-Frequency Fusion with Frequency-Guided Mamba (FGM) and Frequency-Guided Gate (FGG) to produce the fused Y-channel $I_f^Y$. The training objective aggregates content, intensity, gradient, and structural similarity losses into a total objective $ abla = abla_{cont} + oldsymbol{eta} abla_{int} + oldsymbol{ heta} abla_{grad} + abla_{ssim}$, promoting both information preservation and texture fidelity. Experiments on six MMIF datasets demonstrate state-of-the-art or strongly competitive performance, with clear qualitative and quantitative gains and positive downstream task transfer; the authors provide the code at the linked GitHub repository.

Abstract

Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.

Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion

TL;DR

This work tackles multimodal image fusion by introducing ISFM, a framework that enables interaction between spatial and frequency-domain information using a Mamba-based architecture. It combines a Modality-Specific Extractor, a Multi-scale Frequency Fusion that decomposes features into low- and high-frequency bands via Discrete Wavelet Transform, and an Interactive Spatial-Frequency Fusion with Frequency-Guided Mamba (FGM) and Frequency-Guided Gate (FGG) to produce the fused Y-channel . The training objective aggregates content, intensity, gradient, and structural similarity losses into a total objective , promoting both information preservation and texture fidelity. Experiments on six MMIF datasets demonstrate state-of-the-art or strongly competitive performance, with clear qualitative and quantitative gains and positive downstream task transfer; the authors provide the code at the linked GitHub repository.

Abstract

Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.
Paper Structure (33 sections, 22 equations, 17 figures, 10 tables)

This paper contains 33 sections, 22 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The paradigm and performance comparison of our proposed ISFM and existing MMIF methods. (a) Sequential spatial-frequency fusion methods; (b) Parallel spatial-frequency fusion methods; (c) Our proposed ISFM; (d) Performance comparison on MSRS tang2022piafusion and FMB liu2023multi in seven metrics.
  • Figure 2: Overview of our proposed ISFM framework. (a) Modality-Specific Extractor (MSE) extracts modality-specific features; (b) Multi-scale Frequency Fusion (MFF) employs frequency domain fusion in different scales; (c) Interactive Spatial-Frequency Fusion (ISF) incorporates frequency information into the spatial fusion; (d) Structure of Vision State-Space Module (VSSM).
  • Figure 3: Illustration of the proposed MFF.
  • Figure 4: Illustration of the proposed LFFB and HFFB.
  • Figure 5: The architecture of the proposed ISF. (a) Frequency-Guided Mamba (FGM) leverages Mamba to enhance complementary representations; (b) Frequency-Guided Gate (FGG) incorporates frequency features to guide spatial features across modalities.
  • ...and 12 more figures