Table of Contents
Fetching ...

DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection

Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Neel Joshi, Laurent Itti, Vibhav Vineet

TL;DR

This work tackles the labeled data scarcity for object detection by introducing a language-guided, two-stage data generation pipeline that decouples foreground mask production from background context synthesis. Foreground masks are produced from prompts and extracted via unsupervised segmentation, while diverse context images are generated from language-described CDIs using text-to-image models, followed by compositional pasting to create labeled training data. The approach leverages image captioning (SCST) and CLIP-filtering to ensure contextual relevance and avoids interest-class leakage, achieving substantial improvements on VOC, COCO, and various instance datasets, particularly in zero-shot and low-resource settings. The method demonstrates strong compositional properties, privacy-preserving data generation, and applicability to out-of-distribution scenarios, suggesting broad potential for scalable, language-enabled dataset creation in vision tasks.

Abstract

We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-toimage synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach decouples training data generation into foreground object mask generation and background (context) image generation. For foreground object mask generation, we use a simple textual template with object class name as input to DALL-E to generate a diverse set of foreground images. A foreground-background segmentation algorithm is then used to generate foreground object masks. Next, in order to generate context images, first a language description of the context is generated by applying an image captioning method on a small set of images representing the context. These language descriptions are then used to generate diverse sets of context images using the DALL-E framework. These are then composited with object masks generated in the first step to provide an augmented training set for a classifier. We demonstrate the advantages of our approach on four object detection datasets including on Pascal VOC and COCO object detection tasks. Furthermore, we also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios.

DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection

TL;DR

This work tackles the labeled data scarcity for object detection by introducing a language-guided, two-stage data generation pipeline that decouples foreground mask production from background context synthesis. Foreground masks are produced from prompts and extracted via unsupervised segmentation, while diverse context images are generated from language-described CDIs using text-to-image models, followed by compositional pasting to create labeled training data. The approach leverages image captioning (SCST) and CLIP-filtering to ensure contextual relevance and avoids interest-class leakage, achieving substantial improvements on VOC, COCO, and various instance datasets, particularly in zero-shot and low-resource settings. The method demonstrates strong compositional properties, privacy-preserving data generation, and applicability to out-of-distribution scenarios, suggesting broad potential for scalable, language-enabled dataset creation in vision tasks.

Abstract

We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-toimage synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach decouples training data generation into foreground object mask generation and background (context) image generation. For foreground object mask generation, we use a simple textual template with object class name as input to DALL-E to generate a diverse set of foreground images. A foreground-background segmentation algorithm is then used to generate foreground object masks. Next, in order to generate context images, first a language description of the context is generated by applying an image captioning method on a small set of images representing the context. These language descriptions are then used to generate diverse sets of context images using the DALL-E framework. These are then composited with object masks generated in the first step to provide an augmented training set for a classifier. We demonstrate the advantages of our approach on four object detection datasets including on Pascal VOC and COCO object detection tasks. Furthermore, we also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios.
Paper Structure (25 sections, 13 figures, 12 tables)

This paper contains 25 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Comparison of DALL-E for detection pipeline and traditional human-centric pipeline.
  • Figure 2: Our pipeline consists of foreground generation and context background generation. (a) Foreground generation (top row): (1) we fill the interest class name (e.g., dog) into fixed prompt templates to produce foreground sentences. (2) We then feed the sentences to DALL-E (or Stable diffusion) to generate high quality foreground images with easy to separate background. (3) We use off-the-shelf image segmentation methods to extract foreground segments from foreground images. (b) Background context generation (bottom row): (4) we use image captioning method (e.g., SCST rennie2017self) to generate captions for the user provided CDIs (the user can provide as little as one image). (5) we leverage lexical networks and models to extract the background context words (e.g., grass field) and augment more related context images based on ConceptNet (e.g., forest). (6) We create context description sentences based on the context words with templates. (7) We feed the sentences to DALL-E (or Stable diffusion) generate high quality background images. (8) We use CLIP radford2021learning to filter and further ensure that the generated images have no interest class. (9) We combine the foreground segments and background images to obtain synthetic images with corresponding annotations using cut and paste. (10) We use the synthetic dataset to train object detection/segmentation models.
  • Figure 3: We highlight compositional and explainable properties of our method. Specifically, when the provide CDI can not perfectly describe the real test scenario, the compositional property of language can help to correct context description by remove/add/style change. For instance, if the initial description contains noisy information "man and a woman", we can directly intervene and remove the noise information to generate congruent context description. Note that, all the 4 example the test scenario are from GMU kitchen dataset. Images with red frame shows the generated image without language intervene and images with green frame shows the images after intervene.
  • Figure 4: Training images generated by our pipeline: pasting foreground objects on DALL-E synthesized images with CDI form Pascal VOC dataset.
  • Figure 5: Context-images generated from our pipeline. We note that generated images are coherent to input captions.
  • ...and 8 more figures