RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models
Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling-Qi Yan, Miloš Hašan
TL;DR
The paper addresses the gap between inverse and forward rendering by introducing a unified diffusion-based framework for RGB→X intrinsic decomposition and X→RGB image synthesis in interior scenes. It presents an RGB→X model that estimates per-pixel intrinsic channels, including $\mathbf{n}$, $\mathbf{a}$, $\mathbf{r}$, $\mathbf{m}$, and $\mathbf{E}$, with lighting estimation, and an X→RGB model that can synthesize realistic images from partial or full channels, guided by text prompts. A key methodological advance is conditioning via channel dropout, enabling training on heterogeneous datasets and flexible inference with incomplete channel information; low-resolution lighting is provided as a hint, and inpainting is supported for local edits. The results show quantitative and qualitative improvements over prior intrinsic-estimation methods and demonstrate realistic rendering, material replacement, and object insertion, highlighting the practical potential for editing, relighting, and content creation in indoor scenes. This work lays the groundwork for unified diffusion-based pipelines that couple analysis and synthesis for scene editing and realistic rendering.
Abstract
The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB$\rightarrow$X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X$\rightarrow$RGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB$\rightarrow$X, which also estimates lighting, as well as the first diffusion X$\rightarrow$RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X$\rightarrow$RGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.
