Table of Contents
Fetching ...

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Wenhao Li, Zimeng Wu, Yu Wu, Zehua Fu, Jiaxin Chen

Abstract

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

Abstract

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

Paper Structure

This paper contains 26 sections, 16 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of different layout-to-images data generation methods. (a) In general data generation, the layout information is directly derived from real data, and synthesized data is directly used as training data for the detector. (b) Our UAVGen enhances this paradigm by introducing multi-cue, high-quality layout conditions for focal regions, as well as refining the synthesized data.
  • Figure 2: Architecture of Visual Prototype Conditioned Focal Region Generation. (a) Virtual Prototype Conditioned Diffusion Model (VPC-DM) generates images guided by layout images which is produced from selected visual prototypes. (b) Focal Region-Enhanced Data Pipeline (FRE-DP) synthesizes images on object-centric areas to avoid limitation of small object generation. Moreover, Label Refinement mitigates the misalignment between layouts and generated images.
  • Figure 3: Comparison of mAP across different categories on VisDrone. Our method (w syn.) generates 738 synthesized images, yields consistent performance improvements across all categories when compared to the non-augmented baseline (w/o syn.).
  • Figure 4: Comparison of generated images on VisDrone. Our method exhibits superior layout-image consistency and enhanced visual fidelity of generated small-scale objects. The yellow dashed boxes denote the regions where the generated targets are inconsistent with the inputs, as well as blurry low-quality targets. For the generation of small pedestrian targets(line 1), our method achieves significantly higher clarity than the comparative methods. Furthermore, in scenarios with dense small targets(line 2, 3), our method exhibits superior consistency between objects and layout.
  • Figure 5: Impact of various scale of generated images on object detection model. With reduced images, our method achieves superior performance.