Thinking Outside the BBox: Unconstrained Generative Object Compositing

Gemma Canet Tarrés; Zhe Lin; Zhifei Zhang; Jianming Zhang; Yizhi Song; Dan Ruta; Andrew Gilbert; John Collomosse; Soo Ye Kim

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Gemma Canet Tarrés, Zhe Lin, Zhifei Zhang, Jianming Zhang, Yizhi Song, Dan Ruta, Andrew Gilbert, John Collomosse, Soo Ye Kim

TL;DR

This first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism and outperforms existing object placement and compositing models in various quality metrics and user studies.

Abstract

Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image compositing methods leverage diffusion models to handle multiple sub-tasks at once. However, existing models face limitations due to their reliance on masking the original object during training, which constrains their generation to the input mask. Furthermore, obtaining an accurate input mask specifying the location and scale of the object in a new image can be highly challenging. To overcome such limitations, we define a novel problem of unconstrained generative object compositing, i.e., the generation is not bounded by the mask, and train a diffusion-based model on a synthesized paired dataset. Our first-of-its-kind model is able to generate object effects such as shadows and reflections that go beyond the mask, enhancing image realism. Additionally, if an empty mask is provided, our model automatically places the object in diverse natural locations and scales, accelerating the compositing workflow. Our model outperforms existing object placement and compositing models in various quality metrics and user studies.

Thinking Outside the BBox: Unconstrained Generative Object Compositing

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 32 figures, 6 tables)

This paper contains 25 sections, 3 equations, 32 figures, 6 tables.

Introduction
Related Work
Methodology
Data Generation Pipeline
Model Architecture
Training Strategy
Experiments
Comparison to Existing Methods
Effect of Each Training Stage
Applications
Limitations
Conclusion
Unconstrained Image Compositing
Data Generation
Experiments
...and 10 more sections

Figures (32)

Figure 1: Our unconstrained object compositing model has various advantages. When using a bbox (bottom), our model achieves better background preservation (see bird in background) and more natural shadows and reflections than SotA models song2022objectstitchyang2023paintbyexamplelu2023tficonchen2023anydoorzhang2023controlcom by allowing generation beyond the bbox. Without any bbox input (top), our model can automatically place and composite objects in diverse ways.
Figure 1: Visual comparison of our model against ObjectStitch song2022objectstitch, exemplifying the main benefits of introducing unconstrained image compositing as a novel task. (a-c) show different perturbations of the same bbox (red) leading to unnatural results in ObjectStitch; (d-e) Our model can produce more realistic compositions by allowing shadows/reflections beyond the bbox; (f) due to masking the bbox region on the background during training, prior models can create visible changes in the background surrounding the object while our model ensures better preservation.
Figure 2: Visualization of the steps for synthesizing background images. Ground Truth corresponds to original images from pixabay. Our pipeline can be applied to any image.
Figure 2: Visualization for the reflection mask in the data generation pipeline. After obtaining each object mask (2nd column), the last two columns show the inpainting mask if the reflection is obtained by flipping the mask right underneath the object (3rd column) or using the axis computed in Eq \ref{['eq:refl']} (4th column).
Figure 3: Model architecture. Our model consists of: (i) an object encoder $\mathcal{E}$ and a content adaptor $\mathcal{A}$ that encode the object at different scales; (ii) a Stable Diffusion backbone comprised of an autoencoder ($\mathcal{G}$, $\mathcal{D}$) and a U-Net. The multiscale embeddings from (i) are averaged to condition the U-Net via cross-attention. Background image $\mathcal{I}_{BG}$ and a mask $\mathcal{I}_{p}$ are concatenated to the input of (ii). $\mathcal{I}_{p}$ can be empty by setting all values to $-1$. The U-Net is adapted to return the predicted mask $\mathcal{I}'_m$ as an additional output.
...and 27 more figures

Thinking Outside the BBox: Unconstrained Generative Object Compositing

TL;DR

Abstract

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Authors

TL;DR

Abstract

Table of Contents

Figures (32)