DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models
Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang
TL;DR
DiffusionRenderer presents a unified neural framework for both inverse and forward rendering using video diffusion priors to estimate G-buffers from real-world videos and to synthesize photorealistic imagery under target lighting. It employs two coupled video diffusion models—an inverse renderer that predicts per-pixel geometry and material buffers, and a forward renderer that generates images conditioned on these buffers and environment lighting via cross-attention—trained on a large synthetic dataset and real-world auto-labeled data. The approach achieves state-of-the-art performance across forward rendering, inverse rendering, and relighting, and enables practical video editing tasks such as relighting, material editing, and object insertion from a single video. By avoiding explicit 3D geometry and path tracing, DiffusionRenderer demonstrates robust generalization and temporal consistency, illustrating the viability of data-driven neural rendering with diffusion priors for real-world applications.
Abstract
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.
