Table of Contents
Fetching ...

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, Zian Wang

TL;DR

DiPIR addresses the challenge of photorealistic virtual object insertion into single images by jointly estimating scene lighting and tone-mapping through a diffusion-guided, physically based inverse rendering framework. It couples a differentiable path-traced renderer with a personalized diffusion model via an adaptive diffusion guidance (LDS) loss and introduces two-stage environment-map fusion to recover high-frequency lighting cues and accurate shadows. The approach uses a lightweight LoRA-based diffusion personalization and SG-based lighting to enable end-to-end optimization of lighting, shadowing, and tone curves, applicable to indoor and outdoor scenes and across videos. Experimental results on Waymo and PolyHaven demonstrate superior realism and robustness against strong baselines, with ablations validating the contributions of personalization, tone-mapping, and environment-map fusion.

Abstract

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

TL;DR

DiPIR addresses the challenge of photorealistic virtual object insertion into single images by jointly estimating scene lighting and tone-mapping through a diffusion-guided, physically based inverse rendering framework. It couples a differentiable path-traced renderer with a personalized diffusion model via an adaptive diffusion guidance (LDS) loss and introduces two-stage environment-map fusion to recover high-frequency lighting cues and accurate shadows. The approach uses a lightweight LoRA-based diffusion personalization and SG-based lighting to enable end-to-end optimization of lighting, shadowing, and tone curves, applicable to indoor and outdoor scenes and across videos. Experimental results on Waymo and PolyHaven demonstrate superior realism and robustness against strong baselines, with ablations validating the contributions of personalization, tone-mapping, and environment-map fusion.

Abstract

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.
Paper Structure (59 sections, 16 equations, 9 figures, 14 tables)

This paper contains 59 sections, 16 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: We propose DiPIR, a physically based method to recover lighting from a single image, enabling arbitrary virtual object compositing into indoor and outdoor scenes, as well as material and tone-mapping optimization. Project page: https://research.nvidia.com/labs/toronto-ai/DiPIR/
  • Figure 2: Method overview. Given an input image, we first construct a virtual 3D scene with a virtual object and proxy plane. Our physically-based renderer then differentiably simulates the interactions of the optimizable environment map with the inserted virtual object and its effect on the background scene (shadowing) (left). At each iteration, the rendered image is diffused and passed through a personalized diffusion model (middle). The gradient of the adapted Score Distillation formulation is propagated back to the environment map and the tone-mapping curve through the differentiable renderer. Upon convergence, we recover lighting and tone-mapping parameters, which allow photorealistic compositing of virtual objects from a single image (right).
  • Figure 3: Personalization with concept preservation.
  • Figure 4: Ablation study on outdoor driving scenes sun2020scalability. We report the percentage of images that users preferred DiPIR compared to its ablated versions. Our full pipeline produces results that are preferred more often over its ablated versions.
  • Figure 5: Our physically based inverse rendering pipeline unlocks further applications such as material, local emission and tone-mapping refinement.
  • ...and 4 more figures