Table of Contents
Fetching ...

SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation

Yunsong Yang, Genji Yuan, Jinjiang Li

TL;DR

SFFNet tackles remote sensing segmentation under large grayscale variations by fusing spatial features with frequency-domain information through a two-stage network. The Global and Local branches provide robust spatial modeling, while the Wavelet Transform Feature Decomposer adds low- and high-frequency cues, bridged by Multiscale Dual-Representation Alignment Filter for semantic alignment and feature selection. Empirical results on Vaihingen and Potsdam show state-of-the-art performance, with mIoU reaching $84.80\%$ and $87.73\%$ respectively, and improved convergence and robustness in shadowed and edge regions. The approach offers a practical, efficient pathway to more reliable RS segmentation by balancing spatial detail with frequency information, enabling better performance in challenging scenes.

Abstract

In order to fully utilize spatial information for segmentation and address the challenge of handling areas with significant grayscale variations in remote sensing segmentation, we propose the SFFNet (Spatial and Frequency Domain Fusion Network) framework. This framework employs a two-stage network design: the first stage extracts features using spatial methods to obtain features with sufficient spatial details and semantic information; the second stage maps these features in both spatial and frequency domains. In the frequency domain mapping, we introduce the Wavelet Transform Feature Decomposer (WTFD) structure, which decomposes features into low-frequency and high-frequency components using the Haar wavelet transform and integrates them with spatial features. To bridge the semantic gap between frequency and spatial features, and facilitate significant feature selection to promote the combination of features from different representation domains, we design the Multiscale Dual-Representation Alignment Filter (MDAF). This structure utilizes multiscale convolutions and dual-cross attentions. Comprehensive experimental results demonstrate that, compared to existing methods, SFFNet achieves superior performance in terms of mIoU, reaching 84.80% and 87.73% respectively.The code is located at https://github.com/yysdck/SFFNet.

SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation

TL;DR

SFFNet tackles remote sensing segmentation under large grayscale variations by fusing spatial features with frequency-domain information through a two-stage network. The Global and Local branches provide robust spatial modeling, while the Wavelet Transform Feature Decomposer adds low- and high-frequency cues, bridged by Multiscale Dual-Representation Alignment Filter for semantic alignment and feature selection. Empirical results on Vaihingen and Potsdam show state-of-the-art performance, with mIoU reaching and respectively, and improved convergence and robustness in shadowed and edge regions. The approach offers a practical, efficient pathway to more reliable RS segmentation by balancing spatial detail with frequency information, enabling better performance in challenging scenes.

Abstract

In order to fully utilize spatial information for segmentation and address the challenge of handling areas with significant grayscale variations in remote sensing segmentation, we propose the SFFNet (Spatial and Frequency Domain Fusion Network) framework. This framework employs a two-stage network design: the first stage extracts features using spatial methods to obtain features with sufficient spatial details and semantic information; the second stage maps these features in both spatial and frequency domains. In the frequency domain mapping, we introduce the Wavelet Transform Feature Decomposer (WTFD) structure, which decomposes features into low-frequency and high-frequency components using the Haar wavelet transform and integrates them with spatial features. To bridge the semantic gap between frequency and spatial features, and facilitate significant feature selection to promote the combination of features from different representation domains, we design the Multiscale Dual-Representation Alignment Filter (MDAF). This structure utilizes multiscale convolutions and dual-cross attentions. Comprehensive experimental results demonstrate that, compared to existing methods, SFFNet achieves superior performance in terms of mIoU, reaching 84.80% and 87.73% respectively.The code is located at https://github.com/yysdck/SFFNet.
Paper Structure (27 sections, 25 equations, 18 figures, 6 tables)

This paper contains 27 sections, 25 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: The figure illustrates the challenges in remote sensing image segmentation: areas with large grayscale variations (such as shadows, edges, and regions with significant texture changes) are difficult to accurately segment, and the sole use of frequency domain features leads to spatial information loss. The first column shows the original images, the second column shows locally enlarged images, the third column shows the locally enlarged ground truth labels (GT), the fourth column displays segmentation results of some mainstream methods, and the fifth column presents the segmentation results of SFFNet. Specifically: (a) showcases the segmentation results of ST-Unet and SFFNet in shadow areas. (b) shows the segmentation results of ST-Unet and SFFNet in edge areas. (c) demonstrates the segmentation results of ST-Unet and SFFNet (our method) in regions with significant texture changes. (d) displays a scenario where XNet segments a car into two halves due to spatial information loss, while SFFNet offers improvement. From (a) to (c), it can be observed that segmentation methods not utilizing frequency domain features perform poorly in handling areas with large grayscale variations, while (d) illustrates the issue of spatial information loss caused by solely frequency domain-based methods.
  • Figure 2: The main framework of SFFNet illustrates a two-stage segmentation network. The first stage involves spatial feature extraction to acquire sufficient spatial information. Subsequently, various feature mappings are performed in the second stage, including global feature mapping, local feature mapping, and frequency domain feature mapping. Specifically, global feature mapping and local feature mapping preserve diverse spatial information, while frequency domain feature mapping introduces additional frequency domain information. The frequency domain feature mapping is achieved through the WTFD method, followed by alignment of spatial and frequency domain features using MDAF, bridging their semantic gaps and facilitating the combination of both features.
  • Figure 3: This Figure illustrates three feature mapping methods in the feature mapping stage. (a) represents the global branch, which is used to map the original features into global features. This module replaces the Shift window in Swintransformer with vertical bar-shaped convolutions, enabling the improved module to have efficient global modeling capabilities and better adaptability to remote sensing tasks. (b) represents WTFD (Wavelet Transform Feature Decomposer), which provides frequency domain information to the network. This structure decomposes the original features into low-frequency and high-frequency features using the Haar wavelet transform, and subsequently combines them with spatial features to enable the model to consider features in a new representation domain. (c) represents the local branch, which utilizes pooling pyramids to map the original features into spatially local multi-scale features.
  • Figure 4: Establishing remote dependencies between windows using vertical stripwise convolutions. Remote dependencies between windows are established by employing a set of vertical stripwise convolutions on pre-segmented windows. For instance, as illustrated in the figure, feature connections from $x_1$ to both $x_2$ and $x_3$ are achieved through convolutions of length equal to the window size. Additionally, dependencies between $x_2$ and $x_3$ to $x_4$ are established, with each pixel possessing information from other pixels within its respective window, thereby facilitating interaction between windows.
  • Figure 5: Principle diagram of feature decomposition using Haar Wavelet Transform. Where A(x) represents low-pass filtering of the original data to obtain the low-frequency approximation coefficients, and D(x) represents high-pass filtering of the original data to obtain the high-frequency detail coefficients. Each decomposition reduces the size of the features by half. The proposed WTFD in this paper obtains the low-frequency signal A, horizontal high-frequency signal H, vertical high-frequency signal V, and diagonal high-frequency signal D through two iterations of Haar Wavelet Transform.
  • ...and 13 more figures