Table of Contents
Fetching ...

DiffusionSat: A Generative Foundation Model for Satellite Imagery

Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon

TL;DR

DiffusionSat addresses the absence of generative foundation models tailored to satellite imagery by introducing a latent-diffusion framework conditioned on textual captions and rich numeric metadata (geolocation, timestamp, GSD). It adds a novel 3D ControlNet conditioning module to enable inverse problems such as multi-spectral super-resolution, temporal generation, and inpainting, trained on large public RS datasets. The model demonstrates state-of-the-art performance on single-image generation and conditional tasks (fMoW SR, Texas Housing SR, fMoW temporal, xBD inpainting), outperforming baseline diffusion methods. This work broadens the practical impact of generative RS data for disaster response, environmental monitoring, and agricultural analysis by enabling realistic, metadata-guided synthesis across space and time.

Abstract

Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery. The project website can be found here: https://samar-khanna.github.io/DiffusionSat/

DiffusionSat: A Generative Foundation Model for Satellite Imagery

TL;DR

DiffusionSat addresses the absence of generative foundation models tailored to satellite imagery by introducing a latent-diffusion framework conditioned on textual captions and rich numeric metadata (geolocation, timestamp, GSD). It adds a novel 3D ControlNet conditioning module to enable inverse problems such as multi-spectral super-resolution, temporal generation, and inpainting, trained on large public RS datasets. The model demonstrates state-of-the-art performance on single-image generation and conditional tasks (fMoW SR, Texas Housing SR, fMoW temporal, xBD inpainting), outperforming baseline diffusion methods. This work broadens the practical impact of generative RS data for disaster response, environmental monitoring, and agricultural analysis by enabling realistic, metadata-guided synthesis across space and time.

Abstract

Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery. The project website can be found here: https://samar-khanna.github.io/DiffusionSat/
Paper Structure (33 sections, 3 equations, 11 figures, 4 tables)

This paper contains 33 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Conditioning on freely available metadata and using large, publicly available satellite imagery datasets shows DiffusionSat is a powerful generative foundation model for remote sensing data.
  • Figure 2: DiffusionSat flexibly extends to a variety of conditional generation tasks. We design a 3D version of a ControlNet controlnet which can accept a sequence of images. Like regular ControlNets, our 3D ControlNet keeps a trainable copy of SD weights for the downsampling and middle blocks. Latent image features are reshaped to combine the batch and temporal dimensions before being input to these layers. The output of each SD block is then passed through a temporal layer (top right), which re-expands the temporal dimension before passing the latent features though a 3D convolution (initialized with zeros) and a temporal, pixel-wise transformer. The metadata associated with each input image is projected as in \ref{['fig:main']}.
  • Figure 3: Here we generate samples from single-image DiffusionSat. We see that changing the coordinates from a location in Paris to one in USA changes the type of stadium generated, with American football and baseball more likely to appear in the latter location. Additionally, for locations that receive snow, DiffusionSat accurately captures the correlation between location and season. However, naively incorporating the metadata into the text caption results in poorer conditioning flexibility across geography and season, (eg: with winter and summer time images produced for both August and January, or a lack of "zooming in" when lowering the GSD).
  • Figure 4: Generated samples from fMoW-Sentinel superresolution validation set. The conditioning image is the Sentinel-2 multispectral (MS) image represented here as SWIR, NIR, RGB. The desired output is the high-resolution (HR) fMoW-RGB image. Our method is able to capture fine-grained details better than other baselines, even when the low-resolution MS image lacks detail. SD tends to "hallucinate" details.
  • Figure 5: Generated samples from the fMoW-temporal validation set, for temporal prediction. The 4 columns in the center are ground-truth images from the temporal sequence. To the right, we see generated samples for the future-prediction task. The goal is to generate the image marked by the date in red, given the 3 other images (to its left) as conditioning signals. Similarly, for the past-prediction task on the left, the goal is to predict the image marked by the date in blue given the 3 images to its right. DiffusionSat leverages pretrained weights to capture seasonal changes and predict human development better than the baselines. Images are best viewed zoomed in.
  • ...and 6 more figures