Table of Contents
Fetching ...

MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration

Dehua Huo, Weida Zhan, Jinxin Guo, Depeng Zhu, Yu Chen, YiChun Jiang, Yueyi Han, Deng Han, Jin Li

TL;DR

This work tackles the joint problem of building extraction and change detection in remote sensing by introducing MSSFC-Net, a transformer-based dual-task framework that jointly models spatial-spectral features and temporal differences. It proposes three key modules: the DMFE-SSFC for efficient multi-scale spatial-spectral feature learning without extra parameters, the MDFM for robust multi-scale fusion of dual-temporal features, and a segmentation head with task-specific queries to unify downstream outputs. Empirical results on WHU, LEVIR-CD, and BANDON demonstrate state-of-the-art or competitive performance for both tasks and reveal clear evidence of cross-task synergy enabled by shared representations and hierarchical feature interactions. The approach yields higher accuracy and completeness in building delineation and change localization, with ablation studies confirming the essential roles of SSFC, MDFM, and DMFE and demonstrating potential for lightweight variants in future work.

Abstract

Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. However, most existing methods address these tasks independently, overlooking their inherent correlation and failing to exploit shared feature representations for mutual enhancement. Furthermore, the diverse spectral,spatial, and scale characteristics of buildings pose additional challenges in jointly modeling spatial-spectral multi-scale features and effectively balancing precision and recall. The limited synergy between spatial and spectral representations often results in reduced detection accuracy and incomplete change localization.To address these challenges, we propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images. The framework integrates both tasks within a unified architecture, leveraging their complementary nature to simultaneously extract building and change features. Specifically,a Dual-branch Multi-scale Feature Extraction module (DMFE) with Spatial-Spectral Feature Collaboration (SSFC) is designed to enhance multi-scale representation learning, effectively capturing shallow texture details and deep semantic information, thus improving building extraction performance. For temporal feature aggregation, we introduce a Multi-scale Differential Fusion Module (MDFM) that explicitly models the interaction between differential and dual-temporal features. This module refines the network's capability to detect large-area changes and subtle structural variations in buildings. Extensive experiments conducted on three benchmark datasets demonstrate that MSSFC-Net achieves superior performance in both building extraction and change detection tasks, effectively improving detection accuracy while maintaining completeness.

MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration

TL;DR

This work tackles the joint problem of building extraction and change detection in remote sensing by introducing MSSFC-Net, a transformer-based dual-task framework that jointly models spatial-spectral features and temporal differences. It proposes three key modules: the DMFE-SSFC for efficient multi-scale spatial-spectral feature learning without extra parameters, the MDFM for robust multi-scale fusion of dual-temporal features, and a segmentation head with task-specific queries to unify downstream outputs. Empirical results on WHU, LEVIR-CD, and BANDON demonstrate state-of-the-art or competitive performance for both tasks and reveal clear evidence of cross-task synergy enabled by shared representations and hierarchical feature interactions. The approach yields higher accuracy and completeness in building delineation and change localization, with ablation studies confirming the essential roles of SSFC, MDFM, and DMFE and demonstrating potential for lightweight variants in future work.

Abstract

Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. However, most existing methods address these tasks independently, overlooking their inherent correlation and failing to exploit shared feature representations for mutual enhancement. Furthermore, the diverse spectral,spatial, and scale characteristics of buildings pose additional challenges in jointly modeling spatial-spectral multi-scale features and effectively balancing precision and recall. The limited synergy between spatial and spectral representations often results in reduced detection accuracy and incomplete change localization.To address these challenges, we propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images. The framework integrates both tasks within a unified architecture, leveraging their complementary nature to simultaneously extract building and change features. Specifically,a Dual-branch Multi-scale Feature Extraction module (DMFE) with Spatial-Spectral Feature Collaboration (SSFC) is designed to enhance multi-scale representation learning, effectively capturing shallow texture details and deep semantic information, thus improving building extraction performance. For temporal feature aggregation, we introduce a Multi-scale Differential Fusion Module (MDFM) that explicitly models the interaction between differential and dual-temporal features. This module refines the network's capability to detect large-area changes and subtle structural variations in buildings. Extensive experiments conducted on three benchmark datasets demonstrate that MSSFC-Net achieves superior performance in both building extraction and change detection tasks, effectively improving detection accuracy while maintaining completeness.

Paper Structure

This paper contains 24 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Characteristics and challenges of building interpretation in remote sensing images, along with corresponding ground-truth segmentation mask example image pairs. Non-building areas represent the "background," and building areas represent the "foreground."
  • Figure 2: The overall architecture of MSSFC-Net, a model that simultaneously processes dual-temporal imagery, is designed to extract both individual buildings and building changes.The model mainly consists of four components: a multi-scale contextual feature extraction module with spatial-spectral feature coordination, a multi-scale difference fusion module, a decoder that queries the corresponding semantic mask features based on task cues, and a segmentation head that generates the final segmentation results.
  • Figure 3: The DMFE with SSFC divides the input features by channels and introduces them into two separate parallel branch context aggregators to obtain multi-scale information and spatial-spectral feature information.The SSFC strategy generates 3-D attention weights on the feature maps through heuristic computation, without the need for any learnable parameters. Additionally, to improve the efficiency of 3-D attention, relational modeling is performed on the $\bar{Q}$ , $\bar{K}$ and $\bar{V}$ tokens within a channel subset (C/4), enhancing the edges and internal details of dynamic targets in remote sensing images.
  • Figure 4: MSFF structure. The multi-scale features are extracted by combining multi-branch structures and convolution kernels of different scales. The feature information from different branches is aggregated through channel fusion, enabling the comprehensive extraction of features across multiple scales.
  • Figure 5: The MDFM structure. It is used to fuse the features obtained from dual-temporal images, generating differential features with contextual information. After generating the initial differential features $D^{i}$, a multi-scale feature learning mechanism is enhanced to fuse the dual-temporal features.
  • ...and 4 more figures