Table of Contents
Fetching ...

ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Zitian Zhang, Frédéric Fortier-Chouinard, Mathieu Garon, Anand Bhattad, Jean-François Lalonde

TL;DR

ZeroComp addresses realistic 3D object compositing without paired training data by fusing intrinsic image decomposition with a diffusion-based neural renderer conditioned on intrinsic maps via ControlNet. Trained on synthetic intrinsics from OpenRooms, it learns relighting and shadow generation, enabling zero-shot insertion of 3D objects into real scenes and even extending to outdoor and 2D-object scenarios. A purpose-built test dataset and a comprehensive evaluation—including human perceptual studies—demonstrate that ZeroComp matches or surpasses traditional lighting-estimation and SD-based baselines in realism, while preserving object identity and pose. The work highlights the practicality of zero-shot compositing for editing and VFX, while discussing limitations tied to intrinsic-map estimation quality and the potential for broader material, outdoor, and real-world extensions.

Abstract

We present ZeroComp, an effective zero-shot 3D object compositing approach that does not require paired composite-scene images during training. Our method leverages ControlNet to condition from intrinsic images and combines it with a Stable Diffusion model to utilize its scene priors, together operating as an effective rendering engine. During training, ZeroComp uses intrinsic images based on geometry, albedo, and masked shading, all without the need for paired images of scenes with and without composite objects. Once trained, it seamlessly integrates virtual 3D objects into scenes, adjusting shading to create realistic composites. We developed a high-quality evaluation dataset and demonstrate that ZeroComp outperforms methods using explicit lighting estimations and generative techniques in quantitative and human perception benchmarks. Additionally, ZeroComp extends to real and outdoor image compositing, even when trained solely on synthetic indoor data, showcasing its effectiveness in image compositing.

ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

TL;DR

ZeroComp addresses realistic 3D object compositing without paired training data by fusing intrinsic image decomposition with a diffusion-based neural renderer conditioned on intrinsic maps via ControlNet. Trained on synthetic intrinsics from OpenRooms, it learns relighting and shadow generation, enabling zero-shot insertion of 3D objects into real scenes and even extending to outdoor and 2D-object scenarios. A purpose-built test dataset and a comprehensive evaluation—including human perceptual studies—demonstrate that ZeroComp matches or surpasses traditional lighting-estimation and SD-based baselines in realism, while preserving object identity and pose. The work highlights the practicality of zero-shot compositing for editing and VFX, while discussing limitations tied to intrinsic-map estimation quality and the potential for broader material, outdoor, and real-world extensions.

Abstract

We present ZeroComp, an effective zero-shot 3D object compositing approach that does not require paired composite-scene images during training. Our method leverages ControlNet to condition from intrinsic images and combines it with a Stable Diffusion model to utilize its scene priors, together operating as an effective rendering engine. During training, ZeroComp uses intrinsic images based on geometry, albedo, and masked shading, all without the need for paired images of scenes with and without composite objects. Once trained, it seamlessly integrates virtual 3D objects into scenes, adjusting shading to create realistic composites. We developed a high-quality evaluation dataset and demonstrate that ZeroComp outperforms methods using explicit lighting estimations and generative techniques in quantitative and human perception benchmarks. Additionally, ZeroComp extends to real and outdoor image compositing, even when trained solely on synthetic indoor data, showcasing its effectiveness in image compositing.

Paper Structure

This paper contains 14 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: From (a) a target background image and (b) available intrinsic maps (depth, normals, albedo) rendered from a 3D model, our method ZeroComp generates (c) a realistic composite, without access to the scene geometry or lighting, and without being trained specifically for object compositing. ZeroComp realistically shades the object and adds a compelling shadow.
  • Figure 2: Overview of our zero-shot intrinsic compositing pipeline. The input background image $x_\mathrm{bg}$ (top-left) is first converted to intrinsic layers $\mathbf{i}_\mathrm{bg}$ using specialized networks (top, in yellow). In parallel, the corresponding intrinsic layers of the 3D object $\mathbf{i}_\mathrm{obj}$---except the shading---are rendered using a graphics engine (middle, in blue). Layers are then composited together to obtain the composited intrinsics $\mathbf{i}_\mathrm{comp}$ (bottom, in green). From this, our trained ZeroComp renders the final composite $x$ (top-right).
  • Figure 3: Overview of different components from our full compositing equation in \ref{['eqn:comp-final']}. For (a) a given target background image $x_\mathrm{bg}$, diffusion models can create artifacts when rendering (b) background $f_\theta(\mathbf{i}_\mathrm{bg})$ and (c) composite $f_\theta(\mathbf{i}_\mathrm{comp})$ intrinsics. To alleviate this, we compute (d) the shadow opacity ratio of predictions $R$ and, together with (e) the object mask $m$, we can create (f) the final artifacts-free composite $x$. Please see the insets (top-right of each column) for a zoomed-in view of the artifacts created.
  • Figure 4: Qualitative comparison with lighting estimation and image-based methods. Results are sorted from worst (top) to best (bottom) PSNR for "Ours". Please zoom in and refer to the supplementary material for additional images and methods.
  • Figure 5: Effect of the shading mask radius $\lambda$. Generated images (top) and their associated masked shading maps (bottom) are shown. A small radius ($\lambda=0.5$) results in unrealistic shadow shapes, while a large radius ($\lambda=2.0$) produces overly large shadows and a loss of shading detail in the scene.
  • ...and 4 more figures