Table of Contents
Fetching ...

DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models

Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang

TL;DR

DiffusionRenderer presents a unified neural framework for both inverse and forward rendering using video diffusion priors to estimate G-buffers from real-world videos and to synthesize photorealistic imagery under target lighting. It employs two coupled video diffusion models—an inverse renderer that predicts per-pixel geometry and material buffers, and a forward renderer that generates images conditioned on these buffers and environment lighting via cross-attention—trained on a large synthetic dataset and real-world auto-labeled data. The approach achieves state-of-the-art performance across forward rendering, inverse rendering, and relighting, and enables practical video editing tasks such as relighting, material editing, and object insertion from a single video. By avoiding explicit 3D geometry and path tracing, DiffusionRenderer demonstrates robust generalization and temporal consistency, illustrating the viability of data-driven neural rendering with diffusion priors for real-world applications.

Abstract

Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.

DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models

TL;DR

DiffusionRenderer presents a unified neural framework for both inverse and forward rendering using video diffusion priors to estimate G-buffers from real-world videos and to synthesize photorealistic imagery under target lighting. It employs two coupled video diffusion models—an inverse renderer that predicts per-pixel geometry and material buffers, and a forward renderer that generates images conditioned on these buffers and environment lighting via cross-attention—trained on a large synthetic dataset and real-world auto-labeled data. The approach achieves state-of-the-art performance across forward rendering, inverse rendering, and relighting, and enables practical video editing tasks such as relighting, material editing, and object insertion from a single video. By avoiding explicit 3D geometry and path tracing, DiffusionRenderer demonstrates robust generalization and temporal consistency, illustrating the viability of data-driven neural rendering with diffusion priors for real-world applications.

Abstract

Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.

Paper Structure

This paper contains 19 sections, 8 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: We present DiffusionRenderer, a general-purpose method for both neural inverse and forward rendering. From input images or videos, it accurately estimates geometry and material buffers, and generates photorealistic images under specified lighting conditions, offering fundamental tools for image editing applications.
  • Figure 2: Classic PBR relies on explicit 3D geometry, e.g., meshes. When it is not available, screen space ray tracing (SSRT) struggles to accurately represent shadows and reflections (top). PBR is also sensitive to errors in G-buffers -- SSRT with estimated G-buffers from inverse rendering models often fails to deliver quality results (bottom). DiffusionRenderer bypasses these issues, producing photorealistic results without 3D geometry or perfect G-buffers.
  • Figure 3: Method overview. Given an input video, the neural inverse renderer estimates geometry and material properties per pixel. It generates one scene attribute at a time, with the domain embedding indicating the target attributes to generate (Sec. \ref{['sec:neural_inverse_rendering']}). Conversely, the neural forward renderer produces photorealistic images given lighting information, geometry, and material buffers. The lighting condition is injected into the base video diffusion model through cross-attention layers (Sec. \ref{['sec:neural_rendering']}). During joint training with both synthetic and real data, we use an optimizable LoRA for real data sources (Sec. \ref{['sec:training']}).
  • Figure 4: Qualitative comparison of forward rendering. Our method generates high-quality inter-reflections (top) and shadows (bottom), producing more accurate results than the neural baselines.
  • Figure 5: Qualitative comparison of inverse rendering. We compare with RGB$\leftrightarrow$X zeng2024rgb on DL3DV10k dataset. Both methods work well on indoor scenes, while our method predicts finer details in thin structures and more accurate metallic and roughness channels (top), likely benefiting from our curated training data. As compared to RGB$\leftrightarrow$X, our method generalizes better to outdoor scenes (bottom row).
  • ...and 8 more figures