Table of Contents
Fetching ...

A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion

Zihan Cao, Xiao Wu, Liang-Jian Deng, Yu Zhong

TL;DR

This work customize and improve the vision Mamba network designed for the image fusion task, and proposes the local-enhanced vision Mamba block, dubbed as LEVM, which can improve local information perception of the network and simultaneously learn local and global spatial information.

Abstract

In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics.Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from causal language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Codes can be accessed at \url{https://github.com/294coder/Efficient-MIF}.

A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion

TL;DR

This work customize and improve the vision Mamba network designed for the image fusion task, and proposes the local-enhanced vision Mamba block, dubbed as LEVM, which can improve local information perception of the network and simultaneously learn local and global spatial information.

Abstract

In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics.Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from causal language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Codes can be accessed at \url{https://github.com/294coder/Efficient-MIF}.
Paper Structure (21 sections, 12 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 21 sections, 12 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Memory consumption and FLOPs using different operators on the same U-Net architecture. Our LE-Mamba has linear memory consumption compared with quadratic consumption of self-attention (Attention) dosovitskiy2021an and lower FLOPs compared with the attention of Swin Transformer swin.
  • Figure 2: Illustration of (a) the main conception of the proposed LEVM block, (b) the composition of the vision Mamba block, (c) the proposed LEVM block, and (d) the overall architecture of LE-Mamba. "SS2D" in (b) indicates 2D selective scan in liu2024vmamba.
  • Figure 3: Illustration of input feature and corresponding state share feature as Eq. \ref{['eq: ssm-state-sharing']}. Adjacent flow helps to learn more semantic information and skip-connected flow maintains more low-level details and helps fusion. More examples are shown in the supplementary.
  • Figure 4: Error maps of LE-Mamba and compared previous SOTA methods on WV3 test set. Our LE-Mamba shows fewer errors on GT. Some close-ups are depicted at the corner.
  • Figure 5: Error maps of LE-Mamba and compared previous SOTA methods on Harvard ($\times$ 8) test set. Our LE-Mamba shows fewer errors on GT. Some close-ups are depicted at the corner.