Table of Contents
Fetching ...

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

Ahmad Sebaq, Mohamed ElHelw

TL;DR

RSDiff tackles text-driven remote sensing image synthesis with a cascaded diffusion framework: a low-resolution diffusion model (LRDM) first generates a 128×128 image from text, then a text-conditioned super-resolution diffusion model (SRDM) upscales to 256×256. It leverages a frozen T5 text encoder and classifier-free guidance to control conditioning, achieving strong spatial fidelity with a compact ~0.75B parameter footprint. Evaluated on RSICD, RSDiff delivers state-of-the-art FID while maintaining competitive Inception Scores, demonstrating superior semantic consistency in complex scenes compared to several GAN/transformer baselines. The approach offers efficient data augmentation potential for remote sensing and suggests future work on exploiting unlabeled imagery to further improve text-to-image synthesis.

Abstract

The generation and enhancement of satellite imagery are critical in remote sensing, requiring high-quality, detailed images for accurate analysis. This research introduces a two-stage diffusion model methodology for synthesizing high-resolution satellite images from textual prompts. The pipeline comprises a Low-Resolution Diffusion Model (LRDM) that generates initial images based on text inputs and a Super-Resolution Diffusion Model (SRDM) that refines these images into high-resolution outputs. The LRDM merges text and image embeddings within a shared latent space, capturing essential scene content and structure. The SRDM then enhances these images, focusing on spatial features and visual clarity. Experiments conducted using the Remote Sensing Image Captioning Dataset (RSICD) demonstrate that our method outperforms existing models, producing satellite images with accurate geographical details and improved spatial resolution.

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

TL;DR

RSDiff tackles text-driven remote sensing image synthesis with a cascaded diffusion framework: a low-resolution diffusion model (LRDM) first generates a 128×128 image from text, then a text-conditioned super-resolution diffusion model (SRDM) upscales to 256×256. It leverages a frozen T5 text encoder and classifier-free guidance to control conditioning, achieving strong spatial fidelity with a compact ~0.75B parameter footprint. Evaluated on RSICD, RSDiff delivers state-of-the-art FID while maintaining competitive Inception Scores, demonstrating superior semantic consistency in complex scenes compared to several GAN/transformer baselines. The approach offers efficient data augmentation potential for remote sensing and suggests future work on exploiting unlabeled imagery to further improve text-to-image synthesis.

Abstract

The generation and enhancement of satellite imagery are critical in remote sensing, requiring high-quality, detailed images for accurate analysis. This research introduces a two-stage diffusion model methodology for synthesizing high-resolution satellite images from textual prompts. The pipeline comprises a Low-Resolution Diffusion Model (LRDM) that generates initial images based on text inputs and a Super-Resolution Diffusion Model (SRDM) that refines these images into high-resolution outputs. The LRDM merges text and image embeddings within a shared latent space, capturing essential scene content and structure. The SRDM then enhances these images, focusing on spatial features and visual clarity. Experiments conducted using the Remote Sensing Image Captioning Dataset (RSICD) demonstrate that our method outperforms existing models, producing satellite images with accurate geographical details and improved spatial resolution.
Paper Structure (15 sections, 4 equations, 1 figure, 2 tables)

This paper contains 15 sections, 4 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The RSDiff framework employs the T5 text encoder to produce text embeddings from the input text. The conditional diffusion model is employed to convert the textual embedding into a 128x128 image. For image upsample, the RSDiff uses a text-conditional super-resolution diffusion model, thereby enhancing the image resolution to a dimension of 256x256 pixels.