Table of Contents
Fetching ...

STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

Xiaowen Ma, Zhenkai Wu, Mengting Ma, Mengjiao Zhao, Fan Yang, Zhenhong Du, Wei Zhang

TL;DR

STeInFormer introduces a dedicated RSCD backbone built on spatial-temporal interaction transformers, featuring cross-spatial interactors (CSIs) and cross-temporal interactors (CTIs) to actively fuse multi-scale bi-temporal features. It adds a parameter-free multi-frequency mixer leveraging 2D-DCT frequencies to enrich token mixing with linear complexity, while a lightweight decoder and a focal-d Dice-based loss optimize segmentation of changed areas. Extensive experiments on WHU-CD, LEVIR-CD, and CLCD show state-of-the-art F1 scores with a favorable efficiency-accuracy trade-off, and ablations confirm the necessity of both spatio-temporal interactions and frequency-domain mixing. The work suggests STeInFormer as a general RSCD backbone and points to future work in aligning a change-detection head with the proposed encoder for further gains.

Abstract

Convolutional neural networks and attention mechanisms have greatly benefited remote sensing change detection (RSCD) because of their outstanding discriminative ability. Existent RSCD methods often follow a paradigm of using a non-interactive Siamese neural network for multi-temporal feature extraction and change detection heads for feature fusion and change representation. However, this paradigm lacks the contemplation of the characteristics of RSCD in temporal and spatial dimensions, and causes the drawback on spatial-temporal interaction that hinders high-quality feature extraction. To address this problem, we present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction, which is the first general backbone network specifically designed for RSCD. In addition, we propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD. Experimental results on three datasets validate the effectiveness of the proposed method, which can outperform the state-of-the-art methods and achieve the most satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.

STeInFormer: Spatial-Temporal Interaction Transformer Architecture for Remote Sensing Change Detection

TL;DR

STeInFormer introduces a dedicated RSCD backbone built on spatial-temporal interaction transformers, featuring cross-spatial interactors (CSIs) and cross-temporal interactors (CTIs) to actively fuse multi-scale bi-temporal features. It adds a parameter-free multi-frequency mixer leveraging 2D-DCT frequencies to enrich token mixing with linear complexity, while a lightweight decoder and a focal-d Dice-based loss optimize segmentation of changed areas. Extensive experiments on WHU-CD, LEVIR-CD, and CLCD show state-of-the-art F1 scores with a favorable efficiency-accuracy trade-off, and ablations confirm the necessity of both spatio-temporal interactions and frequency-domain mixing. The work suggests STeInFormer as a general RSCD backbone and points to future work in aligning a change-detection head with the proposed encoder for further gains.

Abstract

Convolutional neural networks and attention mechanisms have greatly benefited remote sensing change detection (RSCD) because of their outstanding discriminative ability. Existent RSCD methods often follow a paradigm of using a non-interactive Siamese neural network for multi-temporal feature extraction and change detection heads for feature fusion and change representation. However, this paradigm lacks the contemplation of the characteristics of RSCD in temporal and spatial dimensions, and causes the drawback on spatial-temporal interaction that hinders high-quality feature extraction. To address this problem, we present STeInFormer, a spatial-temporal interaction Transformer architecture for multi-temporal feature extraction, which is the first general backbone network specifically designed for RSCD. In addition, we propose a parameter-free multi-frequency token mixer to integrate frequency-domain features that provide spectral information for RSCD. Experimental results on three datasets validate the effectiveness of the proposed method, which can outperform the state-of-the-art methods and achieve the most satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.

Paper Structure

This paper contains 28 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visualization of two challenges in RSCD: frequent non-interest changes and the requirement for high spatial detail. Example changes of interest changes (red box) and non-interest changes (blue box: non-interest objects; orange box: illumination variations; green box: registration errors) are shown in the lower bi-temporal images. The upper-right chart illustrates the imbalanced distributions of the number of changed pixels and that of the non-changed on three datasets.
  • Figure 2: Architecture of the STeInFormer. Given as input bi-temporal images, multi-scale features are extracted by each CSI, which is U-shaped and relies on base blocks, to fuse the semantic information of high-level features and the spatial detail of low-level features. The deepest features of each CSI are fed to the corresponding CTI for cross-temporal interaction. H-i denotes that i times downsampling is implemented by CSI to ensure that the resolution of the feature maps input to the CTI module are all 1/32 of the original images. The STeInFormer outputs bi-temporal features at four scales.
  • Figure 3: Structure of the multi-frequency mixer. The input feature map $R_p$ is split into $A_i$ along the channel dimension after projection mapping. $A_i$ is then transformed by the pre-selected DCT basis functions $B_{u_i,v_i}$ of the size $p \times p$ to obtain the frequency values $A'_i$. The output feature map $R_f$ is achieved by projection mapping after concatenating all $\{A'_i\}$ along the channel dimension.
  • Figure 4: Example outputs from our STeInFormer and other methods for comparison on WHU-CD (first and second rows), LEVIR-CD (third and fourth rows), and CLCD (fifth and sixth rows). Pixels are colored for visualization (white: true positive; black: true negative; red: false positive; green: false negative).
  • Figure 5: Comparison of performance statistics on three datasets. Bars represent F1-scores, with standard deviations at the top. The bar corresponds to the middle of each method name.
  • ...and 1 more figures