Table of Contents
Fetching ...

A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification

Hao Liu, Yunhao Gao, Wei Li, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone

TL;DR

S^2Fin tackles the challenge of multimodal remote sensing classification under limited labels by introducing a three-domain fusion framework that integrates spatial, spectral, and frequency information. The architecture combines a high-frequency enhancement transformer (HFSET) with sparse spatial-spectral attention, a two-level spatial-frequency fusion using adaptive frequency channels (AFCM) and a high-frequency resonance mask (HFRM), and a spatial-spectral attention fusion (SSAF) module, all leveraging Mamba-based long-range fusion. Ablation studies and experiments on four diverse datasets (HSI+LiDAR, HSI+SAR, MSI+SAR) demonstrate consistent improvements over state-of-the-art methods in OA, AA, and Kappa while maintaining lower complexity. This approach highlights the practical value of explicit frequency-domain learning for robust, data-efficient multimodal remote sensing classification and provides a framework for future integration with Mamba architectures and segmentation/change-detection tasks.

Abstract

Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (S$^2$Fin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that S$^2$Fin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.

A Spatial-Spectral-Frequency Interactive Network for Multimodal Remote Sensing Classification

TL;DR

S^2Fin tackles the challenge of multimodal remote sensing classification under limited labels by introducing a three-domain fusion framework that integrates spatial, spectral, and frequency information. The architecture combines a high-frequency enhancement transformer (HFSET) with sparse spatial-spectral attention, a two-level spatial-frequency fusion using adaptive frequency channels (AFCM) and a high-frequency resonance mask (HFRM), and a spatial-spectral attention fusion (SSAF) module, all leveraging Mamba-based long-range fusion. Ablation studies and experiments on four diverse datasets (HSI+LiDAR, HSI+SAR, MSI+SAR) demonstrate consistent improvements over state-of-the-art methods in OA, AA, and Kappa while maintaining lower complexity. This approach highlights the practical value of explicit frequency-domain learning for robust, data-efficient multimodal remote sensing classification and provides a framework for future integration with Mamba architectures and segmentation/change-detection tasks.

Abstract

Deep learning-based methods have achieved significant success in remote sensing Earth observation data analysis. Numerous feature fusion techniques address multimodal remote sensing image classification by integrating global and local features. However, these techniques often struggle to extract structural and detail features from heterogeneous and redundant multimodal images. With the goal of introducing frequency domain learning to model key and sparse detail features, this paper introduces the spatial-spectral-frequency interaction network (SFin), which integrates pairwise fusion modules across the spatial, spectral, and frequency domains. Specifically, we propose a high-frequency sparse enhancement transformer that employs sparse spatial-spectral attention to optimize the parameters of the high-frequency filter. Subsequently, a two-level spatial-frequency fusion strategy is introduced, comprising an adaptive frequency channel module that fuses low-frequency structures with enhanced high-frequency details, and a high-frequency resonance mask that emphasizes sharp edges via phase similarity. In addition, a spatial-spectral attention fusion module further enhances feature extraction at intermediate layers of the network. Experiments on four benchmark multimodal datasets with limited labeled data demonstrate that SFin performs superior classification, outperforming state-of-the-art methods. The code is available at https://github.com/HaoLiu-XDU/SSFin.

Paper Structure

This paper contains 22 sections, 16 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Workflow comparisons. (a) The interaction and fusion of networks 1 and 2 usually focus on two of the spatial, spectral, and frequency domains. (b) The proposed S$^2$Fin aims to enhance the interaction between the three domains and different levels of the network.
  • Figure 2: Illustration of the proposed S$^2$Fin framework.
  • Figure 3: Spectral curves filtered by low- and high-frequency components of the HSI of the Houston dataset obtained via 1D discrete Fourier transform. The horizontal axis represents the number of bands and the vertical axis represents the reflectivity values. (a) All categories. (b) Categories 1 (healthy grass), 2 (stressed grass), and 3 (synthetic grass). (c) Categories 7 (residential) and 8 (commercial).
  • Figure 4: Structure of HFSET. The left part represents the high-frequency enhancement branch, while the right part is the sparse attention branch. The two branches are merged through a linear layer and a norm layer.
  • Figure 5: Example of images of the HSIs of the Augsburg dataset filtered by low- and high-frequency componentsobtained by applying a 2D DFT along the spatial dimension. Three main bands are selected following principal component analysis (PCA), and ten samples per class are processed by DFT to generate average component magnitude images. The seven class-averaged images are displayed from left to right.
  • ...and 10 more figures