Table of Contents
Fetching ...

Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing

Buddhi Wijenayake, Athulya Ratnayake, Praveen Sumanasekara, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath, Nichula Wasalathilaka

TL;DR

Mamba-FCS tackles semantic change detection in remote sensing by uniting a Visual State Space Model backbone with a joint spatio-frequency fusion module, a change-guided attention mechanism, and a SeK-inspired loss to jointly optimize binary and semantic change tasks. The approach introduces FFT-based frequency cues, a frequency-aware fusion block, and a CGA that propagates change information into semantic decoders, enabling mutual reinforcement between BCD and SCD. Empirical results on the SECOND and Landsat-SCD datasets demonstrate state-of-the-art performance across OA, $F_{scd}$, mIoU, and SeK, with notable improvements on rare transitions and boundary delineation. The work also shows that the linear-complexity VMamba backbone sustains high performance with scalable computational costs, making it well-suited for large-scale, high-resolution SCD deployments in remote sensing.

Abstract

Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.

Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing

TL;DR

Mamba-FCS tackles semantic change detection in remote sensing by uniting a Visual State Space Model backbone with a joint spatio-frequency fusion module, a change-guided attention mechanism, and a SeK-inspired loss to jointly optimize binary and semantic change tasks. The approach introduces FFT-based frequency cues, a frequency-aware fusion block, and a CGA that propagates change information into semantic decoders, enabling mutual reinforcement between BCD and SCD. Empirical results on the SECOND and Landsat-SCD datasets demonstrate state-of-the-art performance across OA, , mIoU, and SeK, with notable improvements on rare transitions and boundary delineation. The work also shows that the linear-complexity VMamba backbone sustains high performance with scalable computational costs, making it well-suited for large-scale, high-resolution SCD deployments in remote sensing.

Abstract

Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.

Paper Structure

This paper contains 38 sections, 30 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the proposed Mamba-FCS architecture. A Siamese encoder processes the pre-change image $I^{T_1}$ and post-change image $I^{T_2}$ through four stages (Stage I–IV) to extract multi-scale features. These features feed a central binary change decoder, which predicts the binary change map $Y_{\mathrm{BCD}}$, and two symmetric semantic decoders that output the semantic maps $Y^{T_1}$ and $Y^{T_2}$. Arrows of type (a) and (b) denote multi-scale skip connections from the pre-change and post-change encoder branches, respectively, to their corresponding decoder stages, (c) indicates within-branch propagation in the semantic decoders, and (d) marks change-guidance connections, where features from the binary decoder are injected into both semantic decoders.
  • Figure 2: Architecture of the Visual State-Space Model (VMamba) backbone, serving as the shared encoder. Following an initial patch-partition layer, the encoder comprises four stages $i=1,2,3,4$, each with $L_i$ Visual State-Space (VSS) blocks. These blocks utilize 2D selective scanning to progressively downsample the spatial resolution (from $H \times W$ to $H/32 \times W/32$) while expanding the channel depth. Feature dimensions extracted from each stage are annotated adjacent to the corresponding blocks.
  • Figure 3: (a) The Joint Spatio-Frequency Feature Fusion ($F_{\text{fusion}}$) block, embedded within each decoder stage $i$, concatenates spatial features $X^{T_1}_i$ and $X^{T_2}_i$, log-amplitude frequency-domain features $F^{T_1}_i$ and $F^{T_2}_i$, and the absolute difference map $D_i$ to form $X_i^{cat}$. (b) The CBAM Refinement Module compresses and refines $X_i^{cat}$ through a Convolutional Block Attention Module (CBAM), yielding the fused output tensor $X^{\text{fused}}_i$.
  • Figure 4: (a) Architecture of the Binary Change Decoder for generating the Binary Change Map $Y_{BCD}$. At each stage, encoder features from two time points ($X_i^{T_1}$ and $X_i^{T_2}$) are fused through a fusion block to obtain $X^{\text{fused}}_i$. The fused features are then passed through a VSS block followed by a CBAM-based upsampling unit. Point-wise addition progressively integrates multi-scale information, while intermediate change maps $\{CM_i\}_{i=1}^4$ are extracted to support the Change-Guided Attention (CGA) module. (b) The architecture of the CBAM-based Upsampling Block, which reduces $C_i$ and increases the $H_i$ and $W_i$ for the next stage.
  • Figure 5: (a) Architecture of the Semantic Map Decoder for the $j^{\text{th}}$ time stamp ($j \in \{1,2\}$). At each stage, the encoder feature $X_i^{T_j}$ is refined using the corresponding change map $CM_i$ through a Change-Guided Attention module, producing $\hat{X}_i^{T_j}$. Point-wise addition is employed at Stages I--III to progressively integrate multi-scale information. The refined features are then processed by a VSS block and upsampled via a CBAM-based upsampling unit. The final output is the semantic map $Y^{T_j}$ for timestamp $T_j$.
  • ...and 8 more figures