Table of Contents
Fetching ...

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

Siran Peng, Xiangyu Zhu, Haoyu Deng, Liang-Jian Deng, Zhen Lei

TL;DR

Quantitative and qualitative valuation results across six datasets demonstrate that the proposed FusionMamba method achieves the state-of-the-art (SOTA) performance, underscoring the effectiveness of FusionMamba.

Abstract

Remote sensing image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral data and a low-resolution image rich in spectral information. Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration. While CNNs are efficient, their limited receptive fields restrict their ability to capture global context. Transformers excel at learning global information but are computationally expensive. Recent advancements in the state space model (SSM), particularly Mamba, present a promising alternative by enabling global perception with low complexity. However, the potential of SSM for information integration remains largely unexplored. Therefore, we propose FusionMamba, an innovative method for efficient remote sensing image fusion. Our contributions are twofold. First, to effectively merge spatial and spectral features, we expand the single-input Mamba block to accommodate dual inputs, creating the FusionMamba block, which serves as a plug-and-play solution for information integration. Second, we incorporate Mamba and FusionMamba blocks into an interpretable network architecture tailored for remote sensing image fusion. Our designs utilize two U-shaped network branches, each primarily composed of four-directional Mamba blocks, to extract spatial and spectral features separately and hierarchically. The resulting feature maps are sufficiently merged in an auxiliary network branch constructed with FusionMamba blocks. Furthermore, we improve the representation of spectral information through an enhanced channel attention module. Quantitative and qualitative valuation results across six datasets demonstrate that our method achieves SOTA performance. The code is available at https://github.com/PSRben/FusionMamba.

FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model

TL;DR

Quantitative and qualitative valuation results across six datasets demonstrate that the proposed FusionMamba method achieves the state-of-the-art (SOTA) performance, underscoring the effectiveness of FusionMamba.

Abstract

Remote sensing image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral data and a low-resolution image rich in spectral information. Current deep learning (DL) methods typically employ convolutional neural networks (CNNs) or Transformers for feature extraction and information integration. While CNNs are efficient, their limited receptive fields restrict their ability to capture global context. Transformers excel at learning global information but are computationally expensive. Recent advancements in the state space model (SSM), particularly Mamba, present a promising alternative by enabling global perception with low complexity. However, the potential of SSM for information integration remains largely unexplored. Therefore, we propose FusionMamba, an innovative method for efficient remote sensing image fusion. Our contributions are twofold. First, to effectively merge spatial and spectral features, we expand the single-input Mamba block to accommodate dual inputs, creating the FusionMamba block, which serves as a plug-and-play solution for information integration. Second, we incorporate Mamba and FusionMamba blocks into an interpretable network architecture tailored for remote sensing image fusion. Our designs utilize two U-shaped network branches, each primarily composed of four-directional Mamba blocks, to extract spatial and spectral features separately and hierarchically. The resulting feature maps are sufficiently merged in an auxiliary network branch constructed with FusionMamba blocks. Furthermore, we improve the representation of spectral information through an enhanced channel attention module. Quantitative and qualitative valuation results across six datasets demonstrate that our method achieves SOTA performance. The code is available at https://github.com/PSRben/FusionMamba.
Paper Structure (52 sections, 13 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 52 sections, 13 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: Different combinations of feature extraction methods and information integration approaches for remote sensing image fusion. The candidate feature extraction methods include the convolution (Conv) layer, self-attention (SA) module vaswani2017attention, and four-directional Mamba (Mamba) block liu2024vmamba. For information integration, the options comprise the concatenation (Concat) operation, cross-attention (CA) module, and the proposed FusionMamba (FMamba) block. For fairness, all combinations are designed with the same number of parameters. Quantitative evaluation results on 20 reduced-resolution samples from the WorldView-3 (WV3) dataset 9844267 demonstrate the superior efficacy and efficiency of our method. For precise values, please refer to Table \ref{['abl5']}.
  • Figure 2: Comparison among the convolution layer in CNNs, the self-attention module in Transformers, and the SSM in bidirectional Mamba zhu2024vision. (a) Suppose we have an image with a resolution of $H\times W$. (b) The convolution operation integrates pixels within a limited receptive field, resulting in a computational complexity of $O(HW)$. (c) The self-attention mechanism uniformly integrates all pixels, which leads to a significantly higher computational complexity of $O(H^2W^2)$. (d) The SSM integrates all pixels along specific directions, with those closer to the output pixel contributing more significantly to the final result. Additionally, its computational complexity is $O(HW)$.
  • Figure 3: The proposed network architecture. Our designs comprise two U-shaped network branches dedicated to feature extraction, a combination branch for information integration, and an MCA module for spectral enhancement. Detailed structures of the Mamba and FusionMamba blocks are depicted in Fig. \ref{['mamba']}.
  • Figure 4: The schematic diagram of the bidirectional Mamba block (first from the left), the four-directional Mamba block (second from the left), and the proposed FusionMamba block (second from the right), along with an illustration depicting the four flattening directions (first from the right). FSSM stands for the fusion state space model. Additionally, the specifics of the SSM and FSSM blocks are detailed in Algorithms \ref{['ssmblock']} and \ref{['fssmblock']}, respectively.
  • Figure 5: Comparison of FLOPs among the convolution layer, bidirectional (BD) Mamba block, four-directional (FD) Mamba block, FusionMamba block, and self/cross-attention module at various spatial resolutions. For optimal visual effects, we configure $D$, $C$, and $N$ to be 0.5M, 256, and 64, respectively.
  • ...and 5 more figures