Table of Contents
Fetching ...

Add-SD: Rational Generation without Manual Reference

Lingfeng Yang, Xinyu Zhang, Xiang Li, Jinwen Chen, Kun Yao, Gang Zhang, Errui Ding, Lingqiao Liu, Jingdong Wang, Jian Yang

TL;DR

Add-SD introduces an instruction-based diffusion pipeline to insert objects into real scenes without manual layouts. By creating a RemovalDataset through object removal and fine-tuning a Stable Diffusion model accordingly, it learns rational object addition driven solely by text prompts. The approach further generates synthetic data for downstream tasks using super-label sampling and grounding-based localization, yielding improvements in rare-class LVIS detection and COCO performance. Empirical results, including user evaluations and quantitative metrics, demonstrate enhanced editing quality, background consistency, and task benefits, with scalable data augmentation potential. This framework reduces manual labeling costs while delivering diverse, plausible scene augmentations that bolster vision tasks with long-tail distributions.

Abstract

Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned on simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks. The first aspect involves creating a RemovalDataset consisting of original-edited image pairs with textual instructions, where an object has been removed from the original image while maintaining strong pixel consistency in the background. These data pairs are then used for fine-tuning the Stable Diffusion (SD) model. Subsequently, the pretrained Add-SD model allows for the insertion of expected objects into an image with good rationale. Additionally, we generate synthetic instances for downstream task datasets at scale, particularly for tail classes, to alleviate the long-tailed problem. Downstream tasks benefit from the enriched dataset with enhanced diversity and rationale. Experiments on LVIS val demonstrate that Add-SD yields an improvement of 4.3 mAP on rare classes over the baseline. Code and models are available at https://github.com/ylingfeng/Add-SD.

Add-SD: Rational Generation without Manual Reference

TL;DR

Add-SD introduces an instruction-based diffusion pipeline to insert objects into real scenes without manual layouts. By creating a RemovalDataset through object removal and fine-tuning a Stable Diffusion model accordingly, it learns rational object addition driven solely by text prompts. The approach further generates synthetic data for downstream tasks using super-label sampling and grounding-based localization, yielding improvements in rare-class LVIS detection and COCO performance. Empirical results, including user evaluations and quantitative metrics, demonstrate enhanced editing quality, background consistency, and task benefits, with scalable data augmentation potential. This framework reduces manual labeling costs while delivering diverse, plausible scene augmentations that bolster vision tasks with long-tail distributions.

Abstract

Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned on simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks. The first aspect involves creating a RemovalDataset consisting of original-edited image pairs with textual instructions, where an object has been removed from the original image while maintaining strong pixel consistency in the background. These data pairs are then used for fine-tuning the Stable Diffusion (SD) model. Subsequently, the pretrained Add-SD model allows for the insertion of expected objects into an image with good rationale. Additionally, we generate synthetic instances for downstream task datasets at scale, particularly for tail classes, to alleviate the long-tailed problem. Downstream tasks benefit from the enriched dataset with enhanced diversity and rationale. Experiments on LVIS val demonstrate that Add-SD yields an improvement of 4.3 mAP on rare classes over the baseline. Code and models are available at https://github.com/ylingfeng/Add-SD.
Paper Structure (17 sections, 2 equations, 10 figures, 10 tables)

This paper contains 17 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The proposed Add-SD pipeline begins with the creation of a RemovalDataset containing image pairs via random instance removal. These datasets are then employed to fine-tune image-to-image generation using the Stable Diffusion Model. Next, generation occurs on the entire dataset by sampling rare classes to alleviate the long-tail issue. Finally, synthetic images are integrated into the original dataset to enhance downstream tasks.
  • Figure 2: Our pipeline uses an object removal operation, which ensures that image pairs have a consistent background.
  • Figure 3: Visualization of RemovalDataset. The first row shows the removal of a single object from COCO images. The second row involves removing partial instances to facilitate multiple generations. The LVIS dataset is similar to COCO but includes more fine-grained categories. The VG and RefCOCO datasets specify target captions containing attribute and relation information.
  • Figure 4: We design a super-label-based sampling strategy to restrict the category of the added object, ensuring rationality. Then, we randomly sample a sub-label within the super-label, assigning a higher weight to tail-class labels to alleviate the long-tail problem. After image generation, the annotations are inherited from the vanilla dataset (green) for the original instance and grounded (red) for the added instance.
  • Figure 5: Copy-Paste ghiasi2021simple and X-Paste zhao2023x pass through a complex synthetic data generation pipeline and may present irrational augmentations. Our Add-SD synthetic augmentation is simple and effective.
  • ...and 5 more figures