Table of Contents
Fetching ...

CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model

Datao Tang, Xiangyong Cao, Xingsong Hou, Zhongyuan Jiang, Junmin Liu, Deyu Meng

TL;DR

CRS-Diff is the first multiple-condition controllable RS generative model that can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process.

Abstract

The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, leveraging the inherent advantages of diffusion models while integrating more advanced control mechanisms. Specifically, CRS-Diff can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process. To effectively integrate multiple condition control information, we introduce a new conditional control mechanism to achieve multi-scale feature fusion, thus enhancing the guiding effect of control conditions. To our knowledge, CRS-Diff is the first multiple-condition controllable RS generative model. Experimental results in single-condition and multiple-condition cases have demonstrated the superior ability of our CRS-Diff to generate RS images both quantitatively and qualitatively compared with previous methods. Additionally, our CRS-Diff can serve as a data engine that generates high-quality training data for downstream tasks, e.g., road extraction. The code is available at https://github.com/Sonettoo/CRS-Diff.

CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model

TL;DR

CRS-Diff is the first multiple-condition controllable RS generative model that can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process.

Abstract

The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, leveraging the inherent advantages of diffusion models while integrating more advanced control mechanisms. Specifically, CRS-Diff can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process. To effectively integrate multiple condition control information, we introduce a new conditional control mechanism to achieve multi-scale feature fusion, thus enhancing the guiding effect of control conditions. To our knowledge, CRS-Diff is the first multiple-condition controllable RS generative model. Experimental results in single-condition and multiple-condition cases have demonstrated the superior ability of our CRS-Diff to generate RS images both quantitatively and qualitatively compared with previous methods. Additionally, our CRS-Diff can serve as a data engine that generates high-quality training data for downstream tasks, e.g., road extraction. The code is available at https://github.com/Sonettoo/CRS-Diff.
Paper Structure (20 sections, 10 equations, 9 figures, 9 tables)

This paper contains 20 sections, 10 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) Comparison between natural image and remote sensing (RS) image. The image content is the Capital Museum of China, sourced from Google Maps and Google Street View, respectively. As can be seen, RS imagery differs significantly from traditional RGB imagery in resolution, coverage area, and information richness. (b) Comparison of the generation results between the two control modes. The upper image is the generation result guided solely by text, while the lower image is the result guided by both text and sketch. As can be seen, the single text control condition fails to generate accurate image content while "text + sketch" conditions can succeed.
  • Figure 2: Visualisation results of our proposed CRS-Diff. (a) Singe text condition generation: the RS images are generated based only on text. (b) Single image condition generation: the RS images are generated based on the image condition. (c) Multi-condition image generation: the RS images are generated under the control of multiple conditions.
  • Figure 3: The overall architecture of our proposed CRS-Diff model. CRS-Diff is mainly based on Stable Diffusion (SD), that diffusion process is performed in latent space. The training of CRS-Diff contains two stages. In the first stage of the training process, we train the backbone U-Net network of SD on text-image pairs. The diffusion network obtained from this training is frozen (blue area) during the second training phase, and the encoder and intermediate blocks are copied into ControlNet to adapt to conditional inputs. In the second stage of training, we stack the conditional images as inputs and extract conditional features using a feature extractor. These features are gradually injected into the encoder of ControlNet (orange area) through a Feature Fusion (FF) module. Here, we use a convolutional network to reshape the obtained feature vectors to the current noise dimension and then integrate them with the noise output of the current block of the ControlNet encoder through Attention Feature Fusion (AFF), achieving multi-scale conditional injection.
  • Figure 4: The raw text was encoded using a CLIP text encoder fine-tuned on RS images. The content image is initially encoded using the CLIP image encoder, and the resulting encoding is then converted into four additional text tokens by a Feed Forward Network(FFN). The metadata is first mapped into fixed intervals and then converted into the same number of tokens by an embedding layer. Finally, these processed encodings are concatenated, replacing the original text-encoded input.
  • Figure 5: Visual comparison of different text-to-image generation methods based on text descriptions on the RSICD test set.
  • ...and 4 more figures