Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion
Yixin Zhu, Long Lv, Pingping Zhang, Xuehu Liu, Tongdan Tang, Feng Tian, Weibing Sun, Huchuan Lu
TL;DR
This work tackles multimodal image fusion by introducing ISFM, a framework that enables interaction between spatial and frequency-domain information using a Mamba-based architecture. It combines a Modality-Specific Extractor, a Multi-scale Frequency Fusion that decomposes features into low- and high-frequency bands via Discrete Wavelet Transform, and an Interactive Spatial-Frequency Fusion with Frequency-Guided Mamba (FGM) and Frequency-Guided Gate (FGG) to produce the fused Y-channel $I_f^Y$. The training objective aggregates content, intensity, gradient, and structural similarity losses into a total objective $ abla = abla_{cont} + oldsymbol{eta} abla_{int} + oldsymbol{ heta} abla_{grad} + abla_{ssim}$, promoting both information preservation and texture fidelity. Experiments on six MMIF datasets demonstrate state-of-the-art or strongly competitive performance, with clear qualitative and quantitative gains and positive downstream task transfer; the authors provide the code at the linked GitHub repository.
Abstract
Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.
