Table of Contents
Fetching ...

Towards Remote Sensing Change Detection with Neural Memory

Zhenyu Yang, Gensheng Pei, Yazhou Yao, Tianfei Zhou, Lizhong Ding, Fumin Shen

TL;DR

This work tackles the challenge of remote sensing change detection by marrying a Titans-inspired neural memory backbone with segmented local attention to simultaneously capture long-range context and preserve local detail at high resolution. The proposed ChangeTitans framework comprises a memory-augmented VTitans backbone, a multi-scale VTitans-Adapter, and a two-stream TS-CBAM fusion module, followed by a convex upsampling decoder and a BCE+Dice loss to ensure accurate and coherent change maps. Across four public benchmarks, including LEVIR-CD, WHU-CD, LEVIR-CD+, SYSU-CD, and SAR-CD, ChangeTitans achieves state-of-the-art performance with competitive computational cost, exemplified by IoU of 84.36% and F1 of 91.52% on LEVIR-CD and strong results on SAR-CD. The results demonstrate the practical impact of integrating neural memory with segmented attention for robust, scalable RSCD, offering a principled path toward efficient, high-precision change mapping in diverse sensing conditions.

Abstract

Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36\%} IoU and \textbf{91.52\%} F1-score on LEVIR-CD, while remaining computationally competitive.

Towards Remote Sensing Change Detection with Neural Memory

TL;DR

This work tackles the challenge of remote sensing change detection by marrying a Titans-inspired neural memory backbone with segmented local attention to simultaneously capture long-range context and preserve local detail at high resolution. The proposed ChangeTitans framework comprises a memory-augmented VTitans backbone, a multi-scale VTitans-Adapter, and a two-stream TS-CBAM fusion module, followed by a convex upsampling decoder and a BCE+Dice loss to ensure accurate and coherent change maps. Across four public benchmarks, including LEVIR-CD, WHU-CD, LEVIR-CD+, SYSU-CD, and SAR-CD, ChangeTitans achieves state-of-the-art performance with competitive computational cost, exemplified by IoU of 84.36% and F1 of 91.52% on LEVIR-CD and strong results on SAR-CD. The results demonstrate the practical impact of integrating neural memory with segmented attention for robust, scalable RSCD, offering a principled path toward efficient, high-precision change mapping in diverse sensing conditions.

Abstract

Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36\%} IoU and \textbf{91.52\%} F1-score on LEVIR-CD, while remaining computationally competitive.
Paper Structure (18 sections, 32 equations, 9 figures, 7 tables)

This paper contains 18 sections, 32 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of IoU on the LEVIR-CD dataset against computational efficiency (FLOPs) and model size (Parameters). The proposed ChangeTitans (●) achieves 84.36% IoU with only 30.39G FLOPs, surpassing existing methods in both detection accuracy and computational efficiency.
  • Figure 2: The overall architecture of the proposed ChangeTitans comprises four main components: (a) a Titans-based visual backbone (VTitans), (b) a lightweight VTitans-Adapter for constructing hierarchical feature representations, (c) a bi-temporal fusion module (TS-CBAM), and (d) a decoder for generating the final binary change maps. The internal structures of VTitans and TS-CBAM are illustrated in the upper and lower sub-blocks, respectively. Details of the VTitans-Adapter are provided in Fig. \ref{['fig:adapter']}, while the Channel Attention and Spatial Attention modules within TS-CBAM are depicted on the right.
  • Figure 3: The structure of our VTitans-Adapter ($L = 12$) is illustrated as follows. It begins with (b1) the Spatial Prior Module to extract initial hierarchical features. These features are then progressively refined using four stages (i.e., Layers 3, 6, 9, and 12) of (b2) Injectors and (b3) Extractors, guided by the non-hierarchical representations from (a) VTitans. The final output serves as the encoder’s multi-scale representation. Feature maps are visualized by PCA.
  • Figure 4: The overall architecture of the VTitans encoder. The input image is split into patches and passed through a series of Titans blocks, each equipped with segmented self-attention and neural memory. The memory modules enable long-range context modeling, while the segmented attention ensures computational efficiency. A linear prediction head follows the final block for downstream feature decoding.
  • Figure 5: Qualitative results on the LEVIR-CD chen2020spatial dataset. Examples are grouped into four challenging scenarios: (a) & (b) irregularly shaped objects, (c) & (d) densely packed regions, (e) & (f) small-scale changes, and (g) & (h) changes near image boundaries. Predicted outputs are color-coded as follows: white for true positives (TP), black for true negatives (TN), green for false positives (FP), and red for false negatives (FN). In the figure, B represents Boundary F1 score, T represents Trimap-based mIoU, and H represents Hausdorff distance.
  • ...and 4 more figures