Table of Contents
Fetching ...

ChangeViT: Unleashing Plain Vision Transformers for Change Detection

Duowang Zhu, Xiaohu Huang, Haiyan Huang, Zhenfeng Shao, Qimin Cheng

TL;DR

This paper tackles change detection in remote sensing by challenging the dominance of CNN backbones and demonstrating that plain Vision Transformers (ViTs) can excel in large-scale changes. The authors introduce ChangeViT, which combines a plain ViT backbone with a lightweight detail-capture module and a cross-attention-based feature injector to fuse fine-grained details into high-level semantic representations. Through extensive experiments on LEVIR-CD, WHU-CD, CLCD, and OSCD, ChangeViT achieves state-of-the-art performance across diverse datasets and scales while remaining more lightweight than many hierarchical architectures. The study highlights the potential of vanilla ViTs for change detection and provides ablations showing the effectiveness of each component and the benefits of pre-training, suggesting broader applicability to other dense prediction tasks.

Abstract

Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs' unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT.

ChangeViT: Unleashing Plain Vision Transformers for Change Detection

TL;DR

This paper tackles change detection in remote sensing by challenging the dominance of CNN backbones and demonstrating that plain Vision Transformers (ViTs) can excel in large-scale changes. The authors introduce ChangeViT, which combines a plain ViT backbone with a lightweight detail-capture module and a cross-attention-based feature injector to fuse fine-grained details into high-level semantic representations. Through extensive experiments on LEVIR-CD, WHU-CD, CLCD, and OSCD, ChangeViT achieves state-of-the-art performance across diverse datasets and scales while remaining more lightweight than many hierarchical architectures. The study highlights the potential of vanilla ViTs for change detection and provides ablations showing the effectiveness of each component and the benefits of pre-training, suggesting broader applicability to other dense prediction tasks.

Abstract

Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs' unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT.
Paper Structure (21 sections, 10 equations, 5 figures, 7 tables)

This paper contains 21 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Performance comparison of different change detectors across three datasets, categorized as CNN-based and ViT-based models. (b) Performance comparison ($\Delta$IoU (%)) between a CNN (ResNet18) and a ViT (ViT-S (DINOv2)) model for detecting changes with various sizes. The horizontal axis incrementally reflects the change sizes, progressing from smallest to largest changes. The $\Delta$IoU values presented are calculated by subtracting the CNN’s performance from that of the ViT for each size category.
  • Figure 2: Overview of the proposed ChangeViT. bi-temporal images $I_{1}$ and $I_{2}$ are firstly fed into shared ViT to extract high-level semantic features and detail-capture module to extract low-level detailed information. Subsequently, a feature injector is introduced to inject the low-level details into high-level features. Finally, a decoder is utilized to predict changed probability maps.
  • Figure 3: Illustration of the feature injectors. $F_{C_i}$ ($i\in\{1,2,3\}$) denote multi-scale detailed features acquired from the detail-capture module, while $F_V$ denotes the ViT's feature lacking detailed information. (a) Let $F_V$ as the query vector, and $F_{C_i}$ as the key and value vectors to capture detailed features for ViT. (b) Using $F_V$ as the key and value vectors, and $F_{C_i}$ as the query vector to refine features for ViT.
  • Figure 4: (a) Each dataset is split into five intervals on average based on the change sizes. The horizontal axis incrementally reflects the change sizes, progressing from smaller to larger changes. (b) The predicted map within the red box indicates a poor detection outcome.
  • Figure 5: Qualitative comparison of different methods on the three datasets. White represents a true positive, black is a true negative, green indicates a false positive, and red is a false negative. Fewer green and red pixels represent better performance. For better clarity, please zoom in on the figure.