Table of Contents
Fetching ...

EraseDraw: Learning to Draw Step-by-Step via Erasing Objects from Images

Alper Canberk, Maksym Bondarenko, Ege Ozguroglu, Ruoshi Liu, Carl Vondrick

TL;DR

EraseDraw tackles the object insertion problem by turning erasing into a learning signal. It builds an autonomous data-generation pipeline that erases objects from in-the-wild images and uses vision-language models to describe removed regions, producing training triplets for a text-conditioned diffusion model. The model is fine-tuned from Stable Diffusion 1.5 on 65,000 autonomous examples and supports step-by-step composition via beam search guided by CLIP, achieving state-of-the-art results on open-world insertion benchmarks. The approach promises more controllable, context-aware content creation while highlighting societal risks and the need for responsible deployment.

Abstract

Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various domains.In addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.

EraseDraw: Learning to Draw Step-by-Step via Erasing Objects from Images

TL;DR

EraseDraw tackles the object insertion problem by turning erasing into a learning signal. It builds an autonomous data-generation pipeline that erases objects from in-the-wild images and uses vision-language models to describe removed regions, producing training triplets for a text-conditioned diffusion model. The model is fine-tuned from Stable Diffusion 1.5 on 65,000 autonomous examples and supports step-by-step composition via beam search guided by CLIP, achieving state-of-the-art results on open-world insertion benchmarks. The approach promises more controllable, context-aware content creation while highlighting societal risks and the need for responsible deployment.

Abstract

Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various domains.In addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.
Paper Structure (36 sections, 1 equation, 17 figures, 2 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: EraseDraw We leverage advancements in image understanding and inpainting to train a model that can insert an object given a language instruction.
  • Figure 2: State-of-the-art image editing methods fail to correctly insert objects into visual scenes. They perform global edits that don't preserve scene context (left)brooks2023instructpix2pix, replace existing objects (middle)su2023magicbrush, and struggle to spatially reason (right) hive_magicbrush_checkpoint. You may see how we did on these examples in Figure \ref{['fig:ours-on-failure']} of the Appendix.
  • Figure 3: EraseDraw Data Generation Pipeline (i) An unlabeled image is sampled taken from a dataset (ii) The images are given to a captioning model, which describes the objects in the image (iii) Objects are detected using the coarse caption from the captioning model, and the objects that are confidently detected are (iv) segmented, (v) and erased. The final images are added to the dataset along with the captions corresponding to them.
  • Figure 4: We show examples from our EraseDraw Dataset.
  • Figure 5: Qualitative Results on EmuEdit Benchmark on inserting people and outdoor objects.
  • ...and 12 more figures