STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

Xiaowen Ma; Zhenkai Wu; Mengting Ma; Mengjiao Zhao; Fan Yang; Zhenhong Du; Wei Zhang

STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

Xiaowen Ma, Zhenkai Wu, Mengting Ma, Mengjiao Zhao, Fan Yang, Zhenhong Du, Wei Zhang

TL;DR

STeInFormer introduces a dedicated RSCD backbone built on spatial-temporal interaction transformers, featuring cross-spatial interactors (CSIs) and cross-temporal interactors (CTIs) to actively fuse multi-scale bi-temporal features. It adds a parameter-free multi-frequency mixer leveraging 2D-DCT frequencies to enrich token mixing with linear complexity, while a lightweight decoder and a focal-d Dice-based loss optimize segmentation of changed areas. Extensive experiments on WHU-CD, LEVIR-CD, and CLCD show state-of-the-art F1 scores with a favorable efficiency-accuracy trade-off, and ablations confirm the necessity of both spatio-temporal interactions and frequency-domain mixing. The work suggests STeInFormer as a general RSCD backbone and points to future work in aligning a change-detection head with the proposed encoder for further gains.

Abstract

Convolutional neural networks and attention mechanisms have greatly benefited remote sensing change detection (RSCD) because of their outstanding discriminative ability. Existent RSCD methods often follow a paradigm of using a non-interactive Siamese neural network for multi-temporal feature extraction and change detection heads for feature fusion and change representation. However, this paradigm lacks the contemplation of the characteristics of RSCD in temporal and spatial dimensions, and causes the drawback on spatial-temporal interaction that hinders high-quality feature extraction. To address this problem, we present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction, which is the first general backbone network specifically designed for RSCD. In addition, we propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD. Experimental results on three datasets validate the effectiveness of the proposed method, which can outperform the state-of-the-art methods and achieve the most satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.

STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

TL;DR

Abstract

STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)