Table of Contents
Fetching ...

Treat Stillness with Movement: Remote Sensing Change Detection via Coarse-grained Temporal Foregrounds Mining

Xixi Wang, Zitian Wang, Jingtao Jiang, Lan Chen, Xiao Wang, Bo Jiang

TL;DR

The paper tackles remote sensing change detection by arguing that motion cues between bi-temporal images are underutilized in traditional pipelines. It introduces the Coarse-grained Temporal Mining Augmented (CTMA) framework, which first converts image pairs into a dense pseudo-video and learns temporal features with a Temporal Encoder to yield a coarse change map, then augments a Coarse-grained Foregrounds Augmented Spatial Encoder (CFA-SE) that fuses global/local information and incorporates motion-augmented and mask-augmented strategies for refinement. A weighted BCE loss supervises both temporal and spatial branches, and experiments on SVCD, LEVIR-CD, and WHU-CD demonstrate state-of-the-art performance with strong ablations validating each component. The work advances RSCD by integrating motion cues and coarse-to-fine fusion, offering improved accuracy and robustness with publicly available code for reproducibility and further research.

Abstract

Current works focus on addressing the remote sensing change detection task using bi-temporal images. Although good performance can be achieved, however, seldom of they consider the motion cues which may also be vital. In this work, we revisit the widely adopted bi-temporal images-based framework and propose a novel Coarse-grained Temporal Mining Augmented (CTMA) framework. To be specific, given the bi-temporal images, we first transform them into a video using interpolation operations. Then, a set of temporal encoders is adopted to extract the motion features from the obtained video for coarse-grained changed region prediction. Subsequently, we design a novel Coarse-grained Foregrounds Augmented Spatial Encoder module to integrate both global and local information. We also introduce a motion augmented strategy that leverages motion cues as an additional output to aggregate with the spatial features for improved results. Meanwhile, we feed the input image pairs into the ResNet to get the different features and also the spatial blocks for fine-grained feature learning. More importantly, we propose a mask augmented strategy that utilizes coarse-grained changed regions, incorporating them into the decoder blocks to enhance the final changed prediction. Extensive experiments conducted on multiple benchmark datasets fully validated the effectiveness of our proposed framework for remote sensing image change detection. The source code of this paper will be released on https://github.com/Event-AHU/CTM_Remote_Sensing_Change_Detection

Treat Stillness with Movement: Remote Sensing Change Detection via Coarse-grained Temporal Foregrounds Mining

TL;DR

The paper tackles remote sensing change detection by arguing that motion cues between bi-temporal images are underutilized in traditional pipelines. It introduces the Coarse-grained Temporal Mining Augmented (CTMA) framework, which first converts image pairs into a dense pseudo-video and learns temporal features with a Temporal Encoder to yield a coarse change map, then augments a Coarse-grained Foregrounds Augmented Spatial Encoder (CFA-SE) that fuses global/local information and incorporates motion-augmented and mask-augmented strategies for refinement. A weighted BCE loss supervises both temporal and spatial branches, and experiments on SVCD, LEVIR-CD, and WHU-CD demonstrate state-of-the-art performance with strong ablations validating each component. The work advances RSCD by integrating motion cues and coarse-to-fine fusion, offering improved accuracy and robustness with publicly available code for reproducibility and further research.

Abstract

Current works focus on addressing the remote sensing change detection task using bi-temporal images. Although good performance can be achieved, however, seldom of they consider the motion cues which may also be vital. In this work, we revisit the widely adopted bi-temporal images-based framework and propose a novel Coarse-grained Temporal Mining Augmented (CTMA) framework. To be specific, given the bi-temporal images, we first transform them into a video using interpolation operations. Then, a set of temporal encoders is adopted to extract the motion features from the obtained video for coarse-grained changed region prediction. Subsequently, we design a novel Coarse-grained Foregrounds Augmented Spatial Encoder module to integrate both global and local information. We also introduce a motion augmented strategy that leverages motion cues as an additional output to aggregate with the spatial features for improved results. Meanwhile, we feed the input image pairs into the ResNet to get the different features and also the spatial blocks for fine-grained feature learning. More importantly, we propose a mask augmented strategy that utilizes coarse-grained changed regions, incorporating them into the decoder blocks to enhance the final changed prediction. Extensive experiments conducted on multiple benchmark datasets fully validated the effectiveness of our proposed framework for remote sensing image change detection. The source code of this paper will be released on https://github.com/Event-AHU/CTM_Remote_Sensing_Change_Detection
Paper Structure (18 sections, 7 equations, 6 figures, 5 tables)

This paper contains 18 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between existing (a). Fine-grained encoder-decoder framework for RSCD; (b). Motion-augmented fine-grained encoder-decoder framework for RSCD; (c). Our newly proposed coarse-grained temporal foregrounds mining for RSCD. Note that, the $I_1$ and $I_2$ are input image pairs.
  • Figure 2: Overview of Coarse-grained Temporal Mining Augmented (CTMA) framework for remote sensing image change detection. It mainly contains two modules, i.e., Temporal Encoder (TE) and Coarse-grained Foregrounds Augmented Spatial Encoder (CFA-SE). Given the bi-temporal images, we first utilize TE to extract the feature representations containing temporal information and generate a preliminary mask map. Subsequently, we introduce CFA-SE to integrate global and local information of image pairs, and further optimize the results with a mask augmented strategy. This strategy dexterously leverage the initial mask map generated by TE as prior knowledge to guide CFA-SE in producing more accurate detection results. In addition, as a supplement to it, we also add a motion augmented strategy to consider the motion information within CFA-SE for the better overall performance.
  • Figure 3: Qualitative results of the interpolated dense video frames on WHU-CD dataset.
  • Figure 4: Visualization of feature maps learned by WHU-CD test set. The brighter the color, the greater its response value. 'G.T.' denotes the ground-truth label of the corresponding image.
  • Figure 5: Visualization of course mask and change detection results acquired by the temporal encoder (TE) and the proposed CTMA method, respectively.
  • ...and 1 more figures