Table of Contents
Fetching ...

GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs

TL;DR

This work introduces GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control, and introduces an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries.

Abstract

We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.

GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

TL;DR

This work introduces GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control, and introduces an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries.

Abstract

We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.
Paper Structure (26 sections, 4 equations, 17 figures, 10 tables)

This paper contains 26 sections, 4 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Existing methods for satellite image synthesis rely on dense spatial layout controls that are expensive and time-consuming to acquire, yet semantically limited. Our proposed method enables conditioning the generation process using semantically rich point queries, where each point is associated with a free-form text prompt. The spatial layout of the generation is guided by the point locations, while the semantics are driven by the accompanying text. Point queries do not impose strict pose and shape constraints on the generation process, resulting in a wide variety of semantically consistent satellite images (see Figure \ref{['point_gen']}).
  • Figure 2: Performance vs Latency. We compare the performance of several remote sensing generative models against the time taken to generate a single image. Both variants of our proposed model, GeoDiT, are efficient and outperform the state-of-the-art generative models.
  • Figure 3: Proposed architecture of our GeoDiT-XL/2-$\alpha$ (left) and GeoDiT-XL/2-$\Sigma$ (right) models.
  • Figure 4: Learned spatial prior for various concepts. Notice that the model has learned concept specific spatial priors such as for "small building" predicting a smaller spatial extent.
  • Figure 5: Samples from GeoDiT-XL/2-$\mathbf{\Sigma}$. Controlling the generation process with point queries without a global text prompt enables flexible and diverse satellite image generation without strict pose and shape constraints. In particular, as shown in the last row, our model generates a single consistent canal from just two input points.
  • ...and 12 more figures