Single-temporal Supervised Remote Change Detection for Domain Generalization

Qiangang Du; Jinlong Peng; Xu Chen; Qingdong He; Liren He; Qiang Nie; Wenbing Zhu; Mingmin Chi; Yabiao Wang; Chengjie Wang

Single-temporal Supervised Remote Change Detection for Domain Generalization

Qiangang Du, Jinlong Peng, Xu Chen, Qingdong He, Liren He, Qiang Nie, Wenbing Zhu, Mingmin Chi, Yabiao Wang, Chengjie Wang

TL;DR

RSCD systems struggle with domain generalization due to reliance on dataset-specific bi-temporal labels. This work introduces ChangeCLIP, a multimodal change-detection framework that extends CLIP-style text-vision alignment to dense RSCD via local patch-visual and pixel-context contrastive learning, aided by dynamic text-context optimization (DTCO). To mitigate data dependency, it proposes SAIN, a single-temporal controllable AI-generated training strategy using ControlNet for synthetic pseudo-pairs, enabling broad generalization. Extensive experiments on LEVIR-CD and WHU-CD show strong generalization and superiority over state-of-the-art detectors, including zero-shot settings when trained on single-temporal data. The approach yields robust performance, and code will be released.

Abstract

Change detection is widely applied in remote sensing image analysis. Existing methods require training models separately for each dataset, which leads to poor domain generalization. Moreover, these methods rely heavily on large amounts of high-quality pair-labelled data for training, which is expensive and impractical. In this paper, we propose a multimodal contrastive learning (ChangeCLIP) based on visual-language pre-training for change detection domain generalization. Additionally, we propose a dynamic context optimization for prompt learning. Meanwhile, to address the data dependency issue of existing methods, we introduce a single-temporal and controllable AI-generated training strategy (SAIN). This allows us to train the model using a large number of single-temporal images without image pairs in the real world, achieving excellent generalization. Extensive experiments on series of real change detection datasets validate the superiority and strong generalization of ChangeCLIP, outperforming state-of-the-art change detection methods. Code will be available.

Single-temporal Supervised Remote Change Detection for Domain Generalization

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 5 figures, 4 tables)

This paper contains 19 sections, 12 equations, 5 figures, 4 tables.

Introduction
Related Work
Supervised Change Detection
Unsupervised Change Detection
Single-Temporal Supervised Change Detection
Vision-Language Models
Method
Preliminary: CLIP and Prompt Engineering
Text-Vision Alignment
Dynamic Text-Vision Context Optimization
Single-Temporal Controllable AI-Generated Training Strategy
Experiment
Experimental Setup
Dataset.
Metrics and Implementation Details
...and 4 more sections

Figures (5)

Figure 1: (a) Self-supervised paradigm Chen_2022_selfsupzhang2023diffucdunsupervisednoh2022unsupervisedRen_2021_GAN, trained by contrastive learning or make pseudo image-pair by simple image augmentation. (b) Pretraining paradigm Zheng_2021_ICCV_changestarSun_2022_TGARSsaha_2019_adaptsaha_2019_unsup, training encoder on single-temporal datasets and predicting by features analysis. (c) ChangeCLIP: Mutimodal contrastive learning based on single-temporal for domain adaptation RSCD.
Figure 2: Overview of ChangeCLIP. ChangeCLIP can be divided into two parts: visual processor and text processor. Firstly ChangeCLIP extracts the region visual-patch features $\bar{f}$ and gets the dense visual features $\hat{f}$ by FPN. Then a K-way classification is used for semantic segmentation. And the local visual-patch alignment loss $\mathcal{L}_{lva}$ is used to alleviate the lack of large-scale pre-trained model on region feature learning. Then the text encoder extracts text embeddings and Fine-tuning CLIP for change detection of generalisability through visual-context alignment loss $\mathcal{L}_{pca}$.
Figure 3: Single-temporal Training Strategy (SAIN) Framework. We separate each class of objects and generates shape maskes by mask generation. During the training phase, we train the ControlNet by inpainting learning. In inference phase, we introduce deliberate category-based changes to each object in the image except buildings.
Figure 4: Visualization analysis for ChangeCLIP with benchmarks. The red regions indicate False Positives (FP), while the blue regions denote False Negatives (FN). Each column represents the respective performance of (a) Past-temporal, (b) post-temporal, (c) groundtruth, (d) Changer arxiv:changer, (e) SARAS chen2023sarasnet, (f) Tiny-CD codegoni2022tinycd, (g) USSFC ussfc, (h) ChangeStar Zheng_2021_ICCV_changestar, (i) ChangeCLIP.
Figure 5: Qualitative analysis result of single-temporal image in Section \ref{['sec:strategy']}. The red area is the area where the buildings overlap in the STAR Zheng_2021_ICCV_changestar, which is impossible in real world. SAIN generates images that encompass both pseudo-change within the same category and changes between different categories.

Single-temporal Supervised Remote Change Detection for Domain Generalization

TL;DR

Abstract

Single-temporal Supervised Remote Change Detection for Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)