Table of Contents
Fetching ...

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma

TL;DR

InstaGen tackles data bottlenecks in object detection by synthesizing a labeled dataset from a diffusion model augmented with an instance grounding head. The approach fine-tunes Stable Diffusion on detection data to produce multi-object, context-rich images, and jointly learns bounding-box localization via an open-vocabulary grounding module trained with base categories and self-trained on novel categories. Detectors trained on the combined real and synthetic data achieve strong gains in open-vocabulary and data-sparse regimes, and show competitive cross-dataset transfer. The work demonstrates that diffusion-based data synthesis, coupled with grounding and self-training, can provide substantial practical benefits for scalable, annotation-free detector training.

Abstract

In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen.

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

TL;DR

InstaGen tackles data bottlenecks in object detection by synthesizing a labeled dataset from a diffusion model augmented with an instance grounding head. The approach fine-tunes Stable Diffusion on detection data to produce multi-object, context-rich images, and jointly learns bounding-box localization via an open-vocabulary grounding module trained with base categories and self-trained on novel categories. Detectors trained on the combined real and synthetic data achieve strong gains in open-vocabulary and data-sparse regimes, and show competitive cross-dataset transfer. The work demonstrates that diffusion-based data synthesis, coupled with grounding and self-training, can provide substantial practical benefits for scalable, annotation-free detector training.

Abstract

In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code: https://fcjian.github.io/InstaGen.
Paper Structure (21 sections, 3 equations, 6 figures, 9 tables)

This paper contains 21 sections, 3 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: (a) The synthetic images generated from Stable Diffusion and our proposed InstaGen, which can serve as a dataset synthesizer for sourcing photo-realistic images and instance bounding boxes at scale. (b) On open-vocabulary detection, training on synthetic images demonstrates significant improvement over CLIP-based methods on novel categories. (c) Training on the synthetic images generated from InstaGen also enhances the detection performance in close-set scenario, particularly in data-sparse circumstances.
  • Figure 2: Illustration of the process for finetuning diffusion model and training the grounding head: (a) stable diffusion model is fine-tuned on the detection dataset on base categories. (b) The grounding head is trained on synthetic images, with supervised learning on base categories and self-training on novel categories.
  • Figure 3: Illustration of the dataset generation process in InstaGen. The data generation process consists of two steps: (i) Image collection: given a text prompt, SDM generates images with the objects described in the text prompt; (ii) Annotation generation: the instance-level grounding head aligns the category embedding with the visual feature region of SDM, generating the corresponding object bounding-boxes.
  • Figure 4: Visualization of the synthetic images and bounding-boxes generated from different models. The bounding-boxes with green denote the objects from base categories, while the ones with red denote the objects from novel categories.
  • Figure S1: Qualitative results generated by our InstaGen. The bounding-boxes with green denote the objects from base categories, while the ones with red denote the objects from novel categories.
  • ...and 1 more figures