Table of Contents
Fetching ...

ObjectStitch: Generative Object Compositing

Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, Daniel Aliaga

TL;DR

The paper tackles realistic object compositing by introducing ObjectStitch, a diffusion-based framework that unifies geometry, color harmonization, lighting, and shadow synthesis under a self-supervised regime. It introduces a content adaptor to translate a reference object into diffusion conditioning and a masked generator to blend the object into a background while preserving identity. A fully self-supervised pipeline using synthetic data and data augmentation enables training without manual labels, and experiments on real-world data show superior realism and fidelity versus baselines. The work demonstrates a practical, end-to-end solution for generative object compositing and outlines future work on improved appearance control and full-background synthesis.

Abstract

Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.

ObjectStitch: Generative Object Compositing

TL;DR

The paper tackles realistic object compositing by introducing ObjectStitch, a diffusion-based framework that unifies geometry, color harmonization, lighting, and shadow synthesis under a self-supervised regime. It introduces a content adaptor to translate a reference object into diffusion conditioning and a masked generator to blend the object into a background while preserving identity. A fully self-supervised pipeline using synthetic data and data augmentation enables training without manual labels, and experiments on real-world data show superior realism and fidelity versus baselines. The work demonstrates a practical, end-to-end solution for generative object compositing and outlines future work on improved appearance control and full-background synthesis.

Abstract

Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
Paper Structure (22 sections, 5 equations, 11 figures, 3 tables)

This paper contains 22 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Example results of object compositing with (a) copy-and-paste scheme, (b) traditional compositing pipeline and (c) ours (ObjectStitch). Traditional compositing pipeline is done with best possible off-the-shelf models including foreground/background color harmonization tsai2017deepjiang2021ssh, poisson blending perez2003poisson, and shadow synthesis sheng2022controllable. ObjectStitch achieves more realistic results, and can address geometry correction, harmonization, shadow generation, and view synthesis all-in-one while preserving similar appearance to the reference object.
  • Figure 2: System pipeline. Our framework consists of a content adaptor and a generator (a pretrained text-to-image diffusion model). The input image $I_o$ is fed into a ViT and the adaptor which produces a descriptive embedding. At the same time the background image $I_{bg}$ is taken as input by the diffusion model. At each iteration during the denoising stage, we apply the mask $M$ on the generated image $I_{out}$, so that the generator only denoises the masked area $I_{out} \bigotimes M$.
  • Figure 3: Structure of the Content Adaptor. In the first stage, it is trained on a large dataset of image-caption pairs to learn multi-modal sequential embeddings containing high-level semantics. In the second stage, it is fine-tuned under the diffusion framework to learn to encode identity features in adaptive embedding.
  • Figure 4: Illustration of our synthetic data generation and data augmentation scheme. The top row shows the data generation process including perspective warping, random rotation, and random color shifting. The original image is used as both the input background and ground truth, while the perturbed object is fed into the adaptor. The bottom row shows crop and shift augmentations, which helps to improve the generation quality and preserve object details.
  • Figure 5: User study results. We conduct side-by-side comparisons between our method and one of baseline methods to quantify the generation quality in terms of realism and appearance preservation. The results show that our method outperforms baselines.
  • ...and 6 more figures