Table of Contents
Fetching ...

Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

Tuong Vy Nguyen, Johannes Hoster, Alexander Glaser, Kristian Hildebrand, Felix Biessmann

TL;DR

This study tackles the challenge of generating realistic satellite imagery for rare objects by fine-tuning diffusion-based models (e.g., DreamBooth, Textual Inversion) and leveraging layout conditioning through a T2I-Adapter, evaluated with both text prompts and Unity-derived layouts. A large-scale human-in-the-loop evaluation (over 30,000 ratings) compares six model configurations against an automatic benchmark, revealing a consistent misalignment between common metrics (IS and its variants) and human perception, while CLIPScore better tracks text alignment. The findings show that while synthetic imagery of rare classes is feasible, automated metrics remain unreliable proxies for visual fidelity, especially in remote-sensing domains, and that layout conditioning can improve structural control. These results underscore the need for human-centered evaluation and careful interpretation of generative-AI metrics, with implications for misinformation risk, transparency, and policy around synthetic satellite imagery.

Abstract

Generative deep learning architectures can produce realistic, high-resolution fake imagery -- with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception -- in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.

Generating Synthetic Satellite Imagery for Rare Objects: An Empirical Comparison of Models and Metrics

TL;DR

This study tackles the challenge of generating realistic satellite imagery for rare objects by fine-tuning diffusion-based models (e.g., DreamBooth, Textual Inversion) and leveraging layout conditioning through a T2I-Adapter, evaluated with both text prompts and Unity-derived layouts. A large-scale human-in-the-loop evaluation (over 30,000 ratings) compares six model configurations against an automatic benchmark, revealing a consistent misalignment between common metrics (IS and its variants) and human perception, while CLIPScore better tracks text alignment. The findings show that while synthetic imagery of rare classes is feasible, automated metrics remain unreliable proxies for visual fidelity, especially in remote-sensing domains, and that layout conditioning can improve structural control. These results underscore the need for human-centered evaluation and careful interpretation of generative-AI metrics, with implications for misinformation risk, transparency, and policy around synthetic satellite imagery.

Abstract

Generative deep learning architectures can produce realistic, high-resolution fake imagery -- with potentially drastic societal implications. A key question in this context is: How easy is it to generate realistic imagery, in particular for niche domains. The iterative process required to achieve specific image content is difficult to automate and control. Especially for rare classes, it remains difficult to assess fidelity, meaning whether generative approaches produce realistic imagery and alignment, meaning how (well) the generation can be guided by human input. In this work, we present a large-scale empirical evaluation of generative architectures which we fine-tuned to generate synthetic satellite imagery. We focus on nuclear power plants as an example of a rare object category - as there are only around 400 facilities worldwide, this restriction is exemplary for many other scenarios in which training and test data is limited by the restricted number of occurrences of real-world examples. We generate synthetic imagery by conditioning on two kinds of modalities, textual input and image input obtained from a game engine that allows for detailed specification of the building layout. The generated images are assessed by commonly used metrics for automatic evaluation and then compared with human judgement from our conducted user studies to assess their trustworthiness. Our results demonstrate that even for rare objects, generation of authentic synthetic satellite imagery with textual or detailed building layouts is feasible. In line with previous work, we find that automated metrics are often not aligned with human perception -- in fact, we find strong negative correlations between commonly used image quality metrics and human ratings.
Paper Structure (24 sections, 2 equations, 6 figures, 5 tables)

This paper contains 24 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Samples of synthetic satellite imagery of facilities. Generated using fine-tuned generative models with only text input (top row) and the same fine-tuned models with additional image input (bottom row).
  • Figure 2: Workflow of our image generation process. We use the renders from the game engine and process them into the respective input modalities (depth/sketch/canny). If used, they're put into the T2I-Adapter and used as structural guidance in the denoising component, together with the text embeddings obtained from the CLIP text encoder. A text prompt could be "an aerial view of a [*] nuclear power plant". All components are pre-trained, in case of fine-tuning, the U-Net (with DreamBooth) or text encoder (with Textual Inversion) are modified.
  • Figure 3: User interfaces of the conducted user studies. Shown are the study designs of the (a) image fidelity, (b) text alignment and (c) layout alignment assessments.
  • Figure 4: All ratings have been normalized for every user and then scaled back to range 1-5. Human ratings of images generated with different models show robust rankings for (a) image fidelity and (b) text alignment. Results for (c) layout alignment seem less expressive, although slight differences are still visible. Humans rate real images consistently as most realistic, but there are substantial differences between models w.r.t. fidelity and alignment scores.
  • Figure 5: The plots show the quality assessment of images generated with different models. Results are depicted by metric for each model for experiments based on (a) image fidelity, (b) text alignment and (c) layout alignment. A comparison of human ratings and offline metrics demonstrates that there's a disparity between the two: Image quality of different models is judged differently, images getting a high rating from humans can get a low score from offline metrics.
  • ...and 1 more figures