Table of Contents
Fetching ...

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, Deyu Meng

TL;DR

RSIOD suffers from limited labeled data, which hampers detector performance. The authors introduce AeroGen, a layout-controllable diffusion framework that supports both horizontal and rotated bounding box conditioning to generate high-quality remote sensing images, paired with an end-to-end data augmentation pipeline that ensures diversity and semantic- layout coherence through a diversity-conditioned generator and filtering. Key contributions include a layout-conditional diffusion model with Fourier-encoded layout inputs and layout mask attention, plus a five-stage generative pipeline (label generation, filtering, image generation, filtering, and augmentation) to produce synthetic data used alongside real data to train detectors. Empirical results on DIOR, DIOR-R, and HRSC show consistent improvements in mAP (e.g., +3.7%, +4.3%, +2.43%) and notable gains in rare object classes, validating the approach’s practical impact for expanding RSIOD datasets with controllable diffusion. Overall, AeroGen demonstrates the effectiveness of conditional diffusion with precise layout control for remote sensing data augmentation and downstream object detection gains.

Abstract

Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at https://github.com/Sonettoo/AeroGen.

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

TL;DR

RSIOD suffers from limited labeled data, which hampers detector performance. The authors introduce AeroGen, a layout-controllable diffusion framework that supports both horizontal and rotated bounding box conditioning to generate high-quality remote sensing images, paired with an end-to-end data augmentation pipeline that ensures diversity and semantic- layout coherence through a diversity-conditioned generator and filtering. Key contributions include a layout-conditional diffusion model with Fourier-encoded layout inputs and layout mask attention, plus a five-stage generative pipeline (label generation, filtering, image generation, filtering, and augmentation) to produce synthetic data used alongside real data to train detectors. Empirical results on DIOR, DIOR-R, and HRSC show consistent improvements in mAP (e.g., +3.7%, +4.3%, +2.43%) and notable gains in rare object classes, validating the approach’s practical impact for expanding RSIOD datasets with controllable diffusion. Overall, AeroGen demonstrates the effectiveness of conditional diffusion with precise layout control for remote sensing data augmentation and downstream object detection gains.

Abstract

Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7%, 4.3%, and 2.43%, respectively. The code is available at https://github.com/Sonettoo/AeroGen.

Paper Structure

This paper contains 14 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Generated images with our proposed AeroGen. AeroGen enables the input of both horizontal and rotated bounding box layout conditions, facilitating accurate remote sensing image layout generation.
  • Figure 2: AeroGen's overall architecture. (a) The layout embedding module combines bounding box coordinates with vectorized semantic information using Fourier and MLP layers. This encodes layout information to facilitate control, with the prompt description processed by a CLIP text encoder for global conditional guidance. (b) The injection of layout information at the noise level is demonstrated, where a local mask governs the injection position of the layout information, allowing for finer layout control. (c) The overall architecture and training process of AeroGen. At each timestep, the image being denoised first passes through a layout information injection module, which enhances layout conditional guidance.
  • Figure 3: Overview of the pipeline based on AeroGen. By fitting the conditional distribution using a diffusion model, we expand a diverse set of layout conditions and combine them with AeroGen to generate synthetic data. Additionally, we introduce two filters to eliminate low-quality synthetic conditions and images, further ensuring the semantic consistency and layout consistency of the synthetic images. Finally, we incorporate synthetic images alongside real images in the training set to improve the performance of downstream tasks.
  • Figure 4: Visualization comparison of the generated image by different methods on the DIOR dataset.
  • Figure 5: Comparison of mAP50 across each category on the DIOR-R dataset under the setting of data augmentation with 50k generation images (aug) and without augmentation (w/o aug).