RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

Ahmad Sebaq; Mohamed ElHelw

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

Ahmad Sebaq, Mohamed ElHelw

TL;DR

RSDiff tackles text-driven remote sensing image synthesis with a cascaded diffusion framework: a low-resolution diffusion model (LRDM) first generates a 128×128 image from text, then a text-conditioned super-resolution diffusion model (SRDM) upscales to 256×256. It leverages a frozen T5 text encoder and classifier-free guidance to control conditioning, achieving strong spatial fidelity with a compact ~0.75B parameter footprint. Evaluated on RSICD, RSDiff delivers state-of-the-art FID while maintaining competitive Inception Scores, demonstrating superior semantic consistency in complex scenes compared to several GAN/transformer baselines. The approach offers efficient data augmentation potential for remote sensing and suggests future work on exploiting unlabeled imagery to further improve text-to-image synthesis.

Abstract

The generation and enhancement of satellite imagery are critical in remote sensing, requiring high-quality, detailed images for accurate analysis. This research introduces a two-stage diffusion model methodology for synthesizing high-resolution satellite images from textual prompts. The pipeline comprises a Low-Resolution Diffusion Model (LRDM) that generates initial images based on text inputs and a Super-Resolution Diffusion Model (SRDM) that refines these images into high-resolution outputs. The LRDM merges text and image embeddings within a shared latent space, capturing essential scene content and structure. The SRDM then enhances these images, focusing on spatial features and visual clarity. Experiments conducted using the Remote Sensing Image Captioning Dataset (RSICD) demonstrate that our method outperforms existing models, producing satellite images with accurate geographical details and improved spatial resolution.

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 1 figure, 2 tables)

This paper contains 15 sections, 4 equations, 1 figure, 2 tables.

Introduction
Related Work
Generative Adversarial Networks
Diffusion Probabilistic Models
Methodology
Pretrained text encoders
Diffusion models and classifier-free guidance
Cascaded diffusion models
Neural network architecture
Experiments
Dataset
Evaluation metrics
Training
Results
Conclusion

Figures (1)

Figure 1: The RSDiff framework employs the T5 text encoder to produce text embeddings from the input text. The conditional diffusion model is employed to convert the textual embedding into a 128x128 image. For image upsample, the RSDiff uses a text-conditional super-resolution diffusion model, thereby enhancing the image resolution to a dimension of 256x256 pixels.

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

TL;DR

Abstract

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (1)