Joint Spatio-Temporal Modeling for the Semantic Change Detection in Remote Sensing Images
Lei Ding, Jing Zhang, Kai Zhang, Haitao Guo, Bing Liu, Lorenzo Bruzzone
TL;DR
The paper tackles semantic change detection in bi-temporal remote sensing imagery by addressing the challenge of learning semantic changes with limited samples and ensuring consistency across time. It introduces SCanNet, a hybrid CNN–Transformer framework that first extracts temporal semantic and change features with a Triple Encoder-Decoder (TED), then models deep spatio-temporal semantic–change dependencies using a Cross-Shaped Window Transformer head (SCanFormer). A semantic-learning scheme with temporal-consistency constraints employs semantic supervision on changes, pseudo-labels for unchanged areas, and a bi-temporal consistency loss to align predictions, achieving state-of-the-art results on SECOND and Landsat-SCD. Ablation studies confirm the benefits of the TED architecture, the semantic-learning losses, and the SCanFormer module. Overall, the approach advances SCD by explicitly modeling semantic–change correlations over space and time, improving both detection accuracy and semantic consistency of bi-temporal results.
Abstract
Semantic Change Detection (SCD) refers to the task of simultaneously extracting the changed areas and the semantic categories (before and after the changes) in Remote Sensing Images (RSIs). This is more meaningful than Binary Change Detection (BCD) since it enables detailed change analysis in the observed areas. Previous works established triple-branch Convolutional Neural Network (CNN) architectures as the paradigm for SCD. However, it remains challenging to exploit semantic information with a limited amount of change samples. In this work, we investigate to jointly consider the spatio-temporal dependencies to improve the accuracy of SCD. First, we propose a Semantic Change Transformer (SCanFormer) to explicitly model the 'from-to' semantic transitions between the bi-temporal RSIs. Then, we introduce a semantic learning scheme to leverage the spatio-temporal constraints, which are coherent to the SCD task, to guide the learning of semantic changes. The resulting network (SCanNet) significantly outperforms the baseline method in terms of both detection of critical semantic changes and semantic consistency in the obtained bi-temporal results. It achieves the SOTA accuracy on two benchmark datasets for the SCD.
