Table of Contents
Fetching ...

Relating CNN-Transformer Fusion Network for Change Detection

Yuhao Gao, Gensheng Pei, Mengmeng Sheng, Zeren Sun, Tao Chen, Yazhou Yao

TL;DR

Change detection in remote sensing is challenged by the need for both global context and fine-grained details. The authors propose RCTNet, a CNN-Transformer hybrid that combines an early fusion backbone, Cross-Stage Aggregation (CSA), a Multi-Scale Feature Fusion (MSF) decoder, and Efficient Self-deciphering Attention (ESA). They introduce CSA to fuse multi-stage features, MSF to capture multi-scale context, and ESA to inject global semantics with low computational cost. On three datasets—WHU-CD, LEVIR-CD, and SYSU-CD—RCTNet achieves state-of-the-art or competitive results, with improved accuracy and an efficient cost profile, validating its effectiveness for bitemporal RS change detection.

Abstract

While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf{(1)} an early fusion backbone to exploit both spatial and temporal features early on, \textbf{(2)} a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf{(3)} a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf{(4)} an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet's clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.

Relating CNN-Transformer Fusion Network for Change Detection

TL;DR

Change detection in remote sensing is challenged by the need for both global context and fine-grained details. The authors propose RCTNet, a CNN-Transformer hybrid that combines an early fusion backbone, Cross-Stage Aggregation (CSA), a Multi-Scale Feature Fusion (MSF) decoder, and Efficient Self-deciphering Attention (ESA). They introduce CSA to fuse multi-stage features, MSF to capture multi-scale context, and ESA to inject global semantics with low computational cost. On three datasets—WHU-CD, LEVIR-CD, and SYSU-CD—RCTNet achieves state-of-the-art or competitive results, with improved accuracy and an efficient cost profile, validating its effectiveness for bitemporal RS change detection.

Abstract

While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf{(1)} an early fusion backbone to exploit both spatial and temporal features early on, \textbf{(2)} a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf{(3)} a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf{(4)} an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet's clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.
Paper Structure (11 sections, 9 equations, 5 figures, 3 tables)

This paper contains 11 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The illustration of challenging scenarios, e.g., weakly discriminate objects and lighting-affected objects. The characteristics of the labeled building presented in the first example are similar to that of the ground. The second example shows shadow interference in changed regions.
  • Figure 2: An illustration of our RCTNet. Our proposed architecture comprises three key modules: a shared-weight backbone for temporal feature extraction, a Cross-Stage Aggregation (CSA, §\ref{['sec:csa']}) module for enhanced representation at each stage, and a U-shape decoder utilizing Multi-Scale Fusion (MSF, §\ref{['sec:pa_msf']}) and Efficient Self-deciphering Attention (ESA, §\ref{['sec:pa_msf']}) for robust decoding. RegNet extracts features from a registered image pair, and the CSA module enriches each stage's output. Finally, the U-shape decoder fuses multi-scale features through MSF and leverages ESA for accurate predictions.
  • Figure 3: Illustration of cross-stage aggregation (CSA).
  • Figure 4: Illustration of the two modules in U-shape decoder.
  • Figure 5: Qualitative comparisons of our proposed method and state-of-the-art approaches on two benchmark datasets, LEVIR-CD chen2020spatial and WHU-CD ji2018fully. Examples (a) and (b) showcase results on LEVIR-CD, while (c) and (d) focus on WHU-CD. In each sample, "GT" represents the ground truth, white areas denote true positives, black areas represent true negatives, red areas indicate false positives, and blue areas represent false negatives.