Relating CNN-Transformer Fusion Network for Change Detection
Yuhao Gao, Gensheng Pei, Mengmeng Sheng, Zeren Sun, Tao Chen, Yazhou Yao
TL;DR
Change detection in remote sensing is challenged by the need for both global context and fine-grained details. The authors propose RCTNet, a CNN-Transformer hybrid that combines an early fusion backbone, Cross-Stage Aggregation (CSA), a Multi-Scale Feature Fusion (MSF) decoder, and Efficient Self-deciphering Attention (ESA). They introduce CSA to fuse multi-stage features, MSF to capture multi-scale context, and ESA to inject global semantics with low computational cost. On three datasets—WHU-CD, LEVIR-CD, and SYSU-CD—RCTNet achieves state-of-the-art or competitive results, with improved accuracy and an efficient cost profile, validating its effectiveness for bitemporal RS change detection.
Abstract
While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf{(1)} an early fusion backbone to exploit both spatial and temporal features early on, \textbf{(2)} a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf{(3)} a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf{(4)} an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet's clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.
