Table of Contents
Fetching ...

CDXLSTM: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory

Zhenkai Wu, Xiaowen Ma, Rongrong Lian, Kai Zheng, Wei Zhang

TL;DR

The paper tackles remote sensing change detection (RS-CD) by addressing the trade-offs of existing CNN, Transformer, and Mamba-based methods in balancing accuracy and efficiency. It introduces CDXLSTM, an XLSTM-based framework with a scale-specific Feature Enhancer (CTGP for global context in low-resolution features and CTSR for spatial refinement in high-resolution features) and a Cross-scale Interactive Fusion (CSIF) module to progressively combine global semantics with detailed spatial information. The architecture uses a Siamese backbone with Bi-mLSTM-based long-term modeling and axial Bi-mLSTM attention within CTSR, delivering linear computational complexity and improved interpretability. On LEVIR-CD, WHU-CD, and CLCD, CDXLSTM achieves state-of-the-art F1 scores with only 16.19M parameters and 3.92G FLOPs, outperforming recent methods while reducing compute, and the training losses combine BCE and Dice terms as $\mathcal{L} = \lambda_{ce}\mathcal{L}_{ce} + \lambda_{dice}\mathcal{L}_{dice}$ to supervise segmentation performance.

Abstract

In complex scenes and varied conditions, effectively integrating spatial-temporal context is crucial for accurately identifying changes. However, current RS-CD methods lack a balanced consideration of performance and efficiency. CNNs lack global context, Transformers are computationally expensive, and Mambas face CUDA dependence and local correlation loss. In this paper, we propose CDXLSTM, with a core component that is a powerful XLSTM-based feature enhancement layer, integrating the advantages of linear computational complexity, global context perception, and strong interpret-ability. Specifically, we introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features, and a Cross-Temporal Spatial Refiner customized for detail-rich shallow features. Additionally, we propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with spatial responses. Extensive experimental results demonstrate that CDXLSTM achieves state-of-the-art performance across three benchmark datasets, offering a compelling balance between efficiency and accuracy. Code is available at https://github.com/xwmaxwma/rschange.

CDXLSTM: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory

TL;DR

The paper tackles remote sensing change detection (RS-CD) by addressing the trade-offs of existing CNN, Transformer, and Mamba-based methods in balancing accuracy and efficiency. It introduces CDXLSTM, an XLSTM-based framework with a scale-specific Feature Enhancer (CTGP for global context in low-resolution features and CTSR for spatial refinement in high-resolution features) and a Cross-scale Interactive Fusion (CSIF) module to progressively combine global semantics with detailed spatial information. The architecture uses a Siamese backbone with Bi-mLSTM-based long-term modeling and axial Bi-mLSTM attention within CTSR, delivering linear computational complexity and improved interpretability. On LEVIR-CD, WHU-CD, and CLCD, CDXLSTM achieves state-of-the-art F1 scores with only 16.19M parameters and 3.92G FLOPs, outperforming recent methods while reducing compute, and the training losses combine BCE and Dice terms as to supervise segmentation performance.

Abstract

In complex scenes and varied conditions, effectively integrating spatial-temporal context is crucial for accurately identifying changes. However, current RS-CD methods lack a balanced consideration of performance and efficiency. CNNs lack global context, Transformers are computationally expensive, and Mambas face CUDA dependence and local correlation loss. In this paper, we propose CDXLSTM, with a core component that is a powerful XLSTM-based feature enhancement layer, integrating the advantages of linear computational complexity, global context perception, and strong interpret-ability. Specifically, we introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features, and a Cross-Temporal Spatial Refiner customized for detail-rich shallow features. Additionally, we propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with spatial responses. Extensive experimental results demonstrate that CDXLSTM achieves state-of-the-art performance across three benchmark datasets, offering a compelling balance between efficiency and accuracy. Code is available at https://github.com/xwmaxwma/rschange.

Paper Structure

This paper contains 13 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: CDXLSTM architecture takes a pair of co-registered remote sensing images as input into a Siamese Backbone with shared weights, producing two feature maps at each stage. CTSR is applied in the first two shallow stages to refine spatial changes, while CTGP is used in the latter two deep stages to capture global changes. Here, Bi-mLSTM refers to an mLSTM module that performs bidirectional scanning. These feature maps are then progressively aggregated through the CSIF module.
  • Figure 2: Example results on LEVIR-CD (row 1), WHU-CD (row 2), and CLCD (row 3) test sets, with pixel color coding: white for true positives, black for true negatives, red for false positives, and green for false negatives.
  • Figure 3: Class activation maps generated by Grad-CAM for the change category of features modulated by the first stage (1/4 resolution of the input image) of the Feature Enhancer (FE). Example images are from the CLCD test set. The configurations represented by "$\circledcirc$" and "$\bot$" correspond to Table \ref{['table:2']}.