Table of Contents
Fetching ...

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

Jingyuan Zhu, Shiyu Li, Yuxuan Liu, Ping Huang, Jiulong Shan, Huimin Ma, Jian Yuan

TL;DR

This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection and exhibits robustness in handling complex scenes and specific domains.

Abstract

Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. In addition, we design an evaluation protocol based on COCO-2014 to validate ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods.

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

TL;DR

This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection and exhibits robustness in handling complex scenes and specific domains.

Abstract

Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. In addition, we design an evaluation protocol based on COCO-2014 to validate ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods.
Paper Structure (25 sections, 2 equations, 19 figures, 24 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 19 figures, 24 tables, 1 algorithm.

Figures (19)

  • Figure 1: The proposed ODGEN enables controllable image generation from bounding boxes and text prompts. It can generate high-quality data for complex scenes, encompassing multiple categories, dense objects, and occlusions, which can be used to enrich the training data for object detection.
  • Figure 2: ODGEN training pipeline: (a) A pre-trained diffusion model is fine-tuned on a detection dataset with both entire images and cropped foreground patches. (b) A text list is built based on class labels. The fine-tuned diffusion model in stage (a) is used to generate a synthetic object image for each text. Generated object images are resized and pasted on empty canvases per box positions, constituting an image list. (c) The image list is concatenated in the channel dimension and encoded as conditions for ControlNet. The text list is encoded by the CLIP text encoder, stacked, and encoded again by the text embedding encoder as inputs for ControlNet.
  • Figure 3: Pipeline for object detection dataset synthesis. Yellow block: estimate Gaussian distributions for the bounding box number, area, aspect ratio, and location based on the training set. Blue block: sample pseudo labels from the Gaussian distributions and generate conditions including text and image lists to synthesize novel images. Pink block: train a classifier with foreground and background patches randomly cropped from the training set and use it to filter pseudo labels that failed to be synthesized. Finally, the filtered labels and synthetic images compose datasets.
  • Figure 4: Comparison between ODGEN and other methods under the same condition shown in the first column. ODGEN can be generalized to specific domains and enables accurate layout control.
  • Figure 5: Visualized results comparison for models trained on COCO. ODGEN is better qualified for synthesizing complex scenes with multiple categories of objects and bounding box occlusions.
  • ...and 14 more figures