Table of Contents
Fetching ...

Object Fidelity Diffusion for Remote Sensing Image Generation

Ziqi Ye, Shuran Ma, Jie Yang, Xiaoyi Yang, Yi Yang, Ziyang Gong, Xue Yang, Haipeng Wang

TL;DR

This work tackles the challenge of high-fidelity, controllable remote sensing image generation by introducing Object Fidelity Diffusion (OF-Diff), which extracts object shape priors from layouts and employs online-distillation to align diffusion outputs without real-image references. It augments diffusion with a dual-decoder architecture and a Shape Generation Module to enforce morphology-consistent objects, and adds DDPO to fine-tune the process for diversity and semantic consistency. Empirical results show OF-Diff outperforms state-of-the-art layout-to-image methods across fidelity, layout accuracy, and downstream detection metrics, with notable gains on small and polymorphic objects. The approach improves practical RS data augmentation for object detection while highlighting trade-offs between aesthetics and distribution fidelity, and it identifies mask quality as a key dependency for shaping results.

Abstract

High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

Object Fidelity Diffusion for Remote Sensing Image Generation

TL;DR

This work tackles the challenge of high-fidelity, controllable remote sensing image generation by introducing Object Fidelity Diffusion (OF-Diff), which extracts object shape priors from layouts and employs online-distillation to align diffusion outputs without real-image references. It augments diffusion with a dual-decoder architecture and a Shape Generation Module to enforce morphology-consistent objects, and adds DDPO to fine-tune the process for diversity and semantic consistency. Empirical results show OF-Diff outperforms state-of-the-art layout-to-image methods across fidelity, layout accuracy, and downstream detection metrics, with notable gains on small and polymorphic objects. The approach improves practical RS data augmentation for object detection while highlighting trade-offs between aesthetics and distribution fidelity, and it identifies mask quality as a key dependency for shaping results.

Abstract

High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

Paper Structure

This paper contains 26 sections, 14 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Four critical failure modes in the State-of-the-Art (SOTA) method (CC-Diff): a distributional drift from real data, visualized by t-SNE; and (a) Control Leakage; (b) Structural Distortion; (c) Dense Generation Collapse. Our OF-Diff (2nd row) effectively resolves these issues.
  • Figure 2: Comparison of OF-Diff with mainstream L2I methods. FG/BG stands for foreground/background. (a) Layout-conditioned baseline. (b) Added instance-based module, limited by quality/quantity of patches from ground truth. (c) OF-Diff enhances fidelity via shape extraction and DDPO, without patch reliance. (d) Results demonstrate superiority.
  • Figure 3: OF-Diff's overall architecture. (a) During training, object shape features extracted by ESGM and image features are processed by ControlNet, and the resulting information is used to update stable diffusion decoders via online-distillation. (b) During sampling, only the label and the shape feature stable diffusion decoder are used to generate synthetic images. (c) Architecture of the Enhanced Shape Generation Module (ESGM).
  • Figure 4: Qualitative results on DIOR, DOTA and HRSC2016. OF-Diff is more realistic and fidelity compared to other methods.
  • Figure 5: AP$_{50}$ on DIOR and DOTA.
  • ...and 7 more figures