Table of Contents
Fetching ...

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, Yedid Hoshen

TL;DR

This work proposes bootstrap supervision; leveraging the object removal model trained on a small counterfactual dataset, this approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.

Abstract

Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene, e.g., occlusions, shadows, and reflections. By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a \q{counterfactual} dataset. Our method involves capturing a scene before and after removing a single object, while minimizing other changes. By fine-tuning a diffusion model on this dataset, we are able to not only remove objects but also their effects on the scene. However, we find that applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this challenge, we propose bootstrap supervision; leveraging our object removal model trained on a small counterfactual dataset, we synthetically expand this dataset considerably. Our approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.

ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion

TL;DR

This work proposes bootstrap supervision; leveraging the object removal model trained on a small counterfactual dataset, this approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.

Abstract

Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene, e.g., occlusions, shadows, and reflections. By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a \q{counterfactual} dataset. Our method involves capturing a scene before and after removing a single object, while minimizing other changes. By fine-tuning a diffusion model on this dataset, we are able to not only remove objects but also their effects on the scene. However, we find that applying this approach for photorealistic object insertion requires an impractically large dataset. To tackle this challenge, we propose bootstrap supervision; leveraging our object removal model trained on a small counterfactual dataset, we synthetically expand this dataset considerably. Our approach significantly outperforms prior methods in photorealistic object removal and insertion, particularly at modeling the effects of objects on the scene.
Paper Structure (27 sections, 7 equations, 16 figures, 5 tables)

This paper contains 27 sections, 7 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Object removal and insertion. Our method models the effect of an object on the scene including occlusions, reflections, and shadows, enabling photorealistic object removal and insertion. It significantly outperforms state-of-the-art baselines.
  • Figure 2: Generalization. Our counterfactual dataset is relatively small and was captured in controlled settings, yet the model generalizes exceptionally well to out-of-distribution scenarios such as removing buildings and large objects.
  • Figure 3: Overview of our method. We collect a counterfactual dataset consisting of photos of scenes before and after removing an object, while keeping everything else fixed. We used this dataset to fine-tune a diffusion model to remove an object and all its effects from the scene. For the task of object insertion, we bootstrap bigger dataset by removing selected objects from a large unsupervised image dataset, resulting in a vast, synthetic counterfactual dataset. Training on this synthetic dataset and then fine tuning on a smaller, original, supervised dataset yields a high quality object insertion model.
  • Figure 4: Object removal - comparison with inpainting. Our model successfully removes the masked object, while the baseline inpainting model replaces it with a different one. Using a mask that covers the reflections (extended mask) may obscure important details from the model.
  • Figure 5: Object removal - comparison with general editing methods. We compare to general editing methods: Emu Edit and MGIE. These methods often replace the object with a new one and introduce unintended changes to the input image. For this comparison we used a text-based segmentation model to mask the object according to the instruction and passed the mask as input to our model.
  • ...and 11 more figures