RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model
Ahmad Sebaq, Mohamed ElHelw
TL;DR
RSDiff tackles text-driven remote sensing image synthesis with a cascaded diffusion framework: a low-resolution diffusion model (LRDM) first generates a 128×128 image from text, then a text-conditioned super-resolution diffusion model (SRDM) upscales to 256×256. It leverages a frozen T5 text encoder and classifier-free guidance to control conditioning, achieving strong spatial fidelity with a compact ~0.75B parameter footprint. Evaluated on RSICD, RSDiff delivers state-of-the-art FID while maintaining competitive Inception Scores, demonstrating superior semantic consistency in complex scenes compared to several GAN/transformer baselines. The approach offers efficient data augmentation potential for remote sensing and suggests future work on exploiting unlabeled imagery to further improve text-to-image synthesis.
Abstract
The generation and enhancement of satellite imagery are critical in remote sensing, requiring high-quality, detailed images for accurate analysis. This research introduces a two-stage diffusion model methodology for synthesizing high-resolution satellite images from textual prompts. The pipeline comprises a Low-Resolution Diffusion Model (LRDM) that generates initial images based on text inputs and a Super-Resolution Diffusion Model (SRDM) that refines these images into high-resolution outputs. The LRDM merges text and image embeddings within a shared latent space, capturing essential scene content and structure. The SRDM then enhances these images, focusing on spatial features and visual clarity. Experiments conducted using the Remote Sensing Image Captioning Dataset (RSICD) demonstrate that our method outperforms existing models, producing satellite images with accurate geographical details and improved spatial resolution.
