Table of Contents
Fetching ...

Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, Yanfei Zhong

TL;DR

This paper presents Changen2, a GPCM implemented with a resolution-scalable diffusion transformer which can generate time series of remote sensing images and corresponding semantic and change labels from labeled and even unlabeled single-temporal images, a “generative change foundation model” that can be trained at scale via self-supervision, and is capable of producing change supervisory signals from unlabeled single-temporal images.

Abstract

Our understanding of the temporal dynamics of the Earth's surface has been advanced by deep vision models, which often require lots of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., change event simulation and semantic change synthesis. To solve these two problems, we present Changen2, a GPCM with a resolution-scalable diffusion transformer which can generate time series of images and their semantic and change labels from labeled or unlabeled single-temporal images. Changen2 is a generative change foundation model that can be trained at scale via self-supervision, and can produce change supervisory signals from unlabeled single-temporal images. Unlike existing foundation models, Changen2 synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Experiments suggest Changen2 has superior spatiotemporal scalability, e.g., Changen2 model trained on 256$^2$ pixel single-temporal images can yield time series of any length and resolutions of 1,024$^2$ pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterparts) and transferability across multiple types of change tasks.

Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

TL;DR

This paper presents Changen2, a GPCM implemented with a resolution-scalable diffusion transformer which can generate time series of remote sensing images and corresponding semantic and change labels from labeled and even unlabeled single-temporal images, a “generative change foundation model” that can be trained at scale via self-supervision, and is capable of producing change supervisory signals from unlabeled single-temporal images.

Abstract

Our understanding of the temporal dynamics of the Earth's surface has been advanced by deep vision models, which often require lots of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., change event simulation and semantic change synthesis. To solve these two problems, we present Changen2, a GPCM with a resolution-scalable diffusion transformer which can generate time series of images and their semantic and change labels from labeled or unlabeled single-temporal images. Changen2 is a generative change foundation model that can be trained at scale via self-supervision, and can produce change supervisory signals from unlabeled single-temporal images. Unlike existing foundation models, Changen2 synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Experiments suggest Changen2 has superior spatiotemporal scalability, e.g., Changen2 model trained on 256 pixel single-temporal images can yield time series of any length and resolutions of 1,024 pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterparts) and transferability across multiple types of change tasks.
Paper Structure (16 sections, 7 equations, 14 figures, 6 tables)

This paper contains 16 sections, 7 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Generative Probabilistic Change Model (GPCM). The bottom subfigure is a case of semantic mask as the condition. $\phi$ denotes an editable condition extractor to provide self-supervision.
  • Figure 2: Attribute Edit: Customized Semantic Transition Matrix. We demonstrate a uniformly sampling case for the category system of the OpenEarthMap dataset. Based on this semantic transition matrix, we can inject the change class prior into synthetic change data, thereby achieving the desired dataset as application scenarios require.
  • Figure 3: Our Changen2 framework. The change event simulation enables adding, removing objects, and editing attributes of objects in the semantic mask at time $t$ to customize new semantic masks at times $t+1~{\rm to}~n$. For the semantic change synthesis, the new images at times $t+1~{\rm to}~n$ will be synthesized by iteratively conditional denoising on the image at time $t$. Changen2 can generate the multi-temporal dataset with controllable scene layout, object property (e.g., scale, position, orientation, semantics, see $\mathbf{I}_{t+n}$), and change event. Legend: Create; Remove.
  • Figure 4: Network architecture of RS-DiT. Based on DiT architecture, we make two small but important improvements: (i) remove absolute position embedding and insert 3$\times$3 depthwise convolution in FFN; (ii) replace global self-attention with local window attention. In addition, we introduce a dense embedding network to encode dense conditions, thereby enabling tasks that require dense conditional image generation. Our improvements are highlighted in color.
  • Figure 5: Analysis and Ablation: Spatial Resolution Scalability (256$^2$ px to 512$^2$ px). All generated 512$\times$512 images are synthesized post-event images without pre-event image guidance. The top left three results are generated from a DiT model trained at 256$\times$256.
  • ...and 9 more figures