Table of Contents
Fetching ...

Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution

Yi Xiao, Qiangqiang Yuan, Kui Jiang, Yuzeng Chen, Qiang Zhang, Chia-Wen Lin

TL;DR

This paper addresses RSI-SR by introducing Frequency-assisted Mamba for RSI-SR (FMSR), a framework that fuses Vision State Space Modeling with frequency-domain cues to achieve scalable long-range modeling. By integrating a Frequency Selection Module and a Hybrid Gate Module within Frequency-assisted Mamba Blocks, FMSR captures both global and local dependencies in a frequency-spatial dual domain, while learnable adapters enable effective multi-level feature fusion. Empirical results on AID, DOTA, and DIOR show that FMSR outperforms state-of-the-art Transformer-based methods with a notable reduction in memory and compute, and FMSR++ further benefits from self-ensembling. The work demonstrates a practical, efficient pathway for high-quality RSI-SR suitable for large-scale remote sensing applications, with code to be released publicly.

Abstract

Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Considering that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively. Code will be available at https://github.com/XY-boy/FreMamba

Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution

TL;DR

This paper addresses RSI-SR by introducing Frequency-assisted Mamba for RSI-SR (FMSR), a framework that fuses Vision State Space Modeling with frequency-domain cues to achieve scalable long-range modeling. By integrating a Frequency Selection Module and a Hybrid Gate Module within Frequency-assisted Mamba Blocks, FMSR captures both global and local dependencies in a frequency-spatial dual domain, while learnable adapters enable effective multi-level feature fusion. Empirical results on AID, DOTA, and DIOR show that FMSR outperforms state-of-the-art Transformer-based methods with a notable reduction in memory and compute, and FMSR++ further benefits from self-ensembling. The work demonstrates a practical, efficient pathway for high-quality RSI-SR suitable for large-scale remote sensing applications, with code to be released publicly.

Abstract

Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Considering that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively. Code will be available at https://github.com/XY-boy/FreMamba
Paper Structure (24 sections, 15 equations, 13 figures, 8 tables)

This paper contains 24 sections, 15 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The Effective Receptive Field (ERF) erf comparison for (a) CNN-based method NLSN nlsa, (b) Transformer-based model ATD ATD, and the proposed Mamba-based network FMSR. A wider distribution of dark areas demonstrates larger ERF. Our FMSR effectively obtains the largest ERF, indicating favorable global exploration capability.
  • Figure 2: Overview of the proposed FMSR. The Frequency-assisted Mamba Blocks (FMB) are arranged sequentially in Frequency-assisted Mamba Groups (FMG). In FMB, a Frequency Selection Module (FSM) is adopted to assist the learning process of the Vision State Space Module (VSSM) and Hybrid Gate Module (HGM). $\alpha_l$ is a learnable adaptor for hybrid adaptive integration in the $l$-th FMB.
  • Figure 3: The proposed Hybrid Gate Module (HGM) conceptual illustration. The input feature X is split in the channel dimension and fed through a Channel Attention Block (CAB) and a pixel-wise linear projection layer, respectively. After a Hadamard product operation, a $1\times1$ convolution generates the output tensor Y.
  • Figure 4: Three variants of Frequency Selection Module (FSM). Here, we adopt 2D Fast Fourier Transformation (FFT) for frequency learning.
  • Figure 5: Feature visualization comparisons. The feature maps corresponding to each reference image are the results of the 56-th channels in the final FMG.
  • ...and 8 more figures