Table of Contents
Fetching ...

RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models

Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling-Qi Yan, Miloš Hašan

TL;DR

The paper addresses the gap between inverse and forward rendering by introducing a unified diffusion-based framework for RGB→X intrinsic decomposition and X→RGB image synthesis in interior scenes. It presents an RGB→X model that estimates per-pixel intrinsic channels, including $\mathbf{n}$, $\mathbf{a}$, $\mathbf{r}$, $\mathbf{m}$, and $\mathbf{E}$, with lighting estimation, and an X→RGB model that can synthesize realistic images from partial or full channels, guided by text prompts. A key methodological advance is conditioning via channel dropout, enabling training on heterogeneous datasets and flexible inference with incomplete channel information; low-resolution lighting is provided as a hint, and inpainting is supported for local edits. The results show quantitative and qualitative improvements over prior intrinsic-estimation methods and demonstrate realistic rendering, material replacement, and object insertion, highlighting the practical potential for editing, relighting, and content creation in indoor scenes. This work lays the groundwork for unified diffusion-based pipelines that couple analysis and synthesis for scene editing and realistic rendering.

Abstract

The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGB$\rightarrow$X problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, X$\rightarrow$RGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGB$\rightarrow$X, which also estimates lighting, as well as the first diffusion X$\rightarrow$RGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our X$\rightarrow$RGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.

RGB$\leftrightarrow$X: Image decomposition and synthesis using material- and lighting-aware diffusion models

TL;DR

The paper addresses the gap between inverse and forward rendering by introducing a unified diffusion-based framework for RGB→X intrinsic decomposition and X→RGB image synthesis in interior scenes. It presents an RGB→X model that estimates per-pixel intrinsic channels, including , , , , and , with lighting estimation, and an X→RGB model that can synthesize realistic images from partial or full channels, guided by text prompts. A key methodological advance is conditioning via channel dropout, enabling training on heterogeneous datasets and flexible inference with incomplete channel information; low-resolution lighting is provided as a hint, and inpainting is supported for local edits. The results show quantitative and qualitative improvements over prior intrinsic-estimation methods and demonstrate realistic rendering, material replacement, and object insertion, highlighting the practical potential for editing, relighting, and content creation in indoor scenes. This work lays the groundwork for unified diffusion-based pipelines that couple analysis and synthesis for scene editing and realistic rendering.

Abstract

The three areas of realistic forward rendering, per-pixel inverse rendering, and generative image synthesis may seem like separate and unrelated sub-fields of graphics and vision. However, recent work has demonstrated improved estimation of per-pixel intrinsic channels (albedo, roughness, metallicity) based on a diffusion architecture; we call this the RGBX problem. We further show that the reverse problem of synthesizing realistic images given intrinsic channels, XRGB, can also be addressed in a diffusion framework. Focusing on the image domain of interior scenes, we introduce an improved diffusion model for RGBX, which also estimates lighting, as well as the first diffusion XRGB model capable of synthesizing realistic images from (full or partial) intrinsic channels. Our XRGB model explores a middle ground between traditional rendering and generative models: we can specify only certain appearance properties that should be followed, and give freedom to the model to hallucinate a plausible version of the rest. This flexibility makes it possible to use a mix of heterogeneous training datasets, which differ in the available channels. We use multiple existing datasets and extend them with our own synthetic and real data, resulting in a model capable of extracting scene properties better than previous work and of generating highly realistic images of interior scenes.
Paper Structure (39 sections, 7 equations, 10 figures, 2 tables)

This paper contains 39 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: High-level overview of our two diffusion models. Left: The RGB$\rightarrow$X model takes the input image, encoded into latent space by the pre-trained encoder, concatenated with the diffusion latent. We repurpose the text prompt to a switch choosing the desired output channel; this allows training with datasets containing any subset of the supported channels. Right: The X$\rightarrow$RGB model concatenates the input intrinsic channels, again encoded by the pre-trained encoder, with the diffusion latent. One exception is the irradiance (lighting) channel, which is downsampled to latent resolution rather than passed through the encoder. This model can accept usual text prompts. All input conditions to X$\rightarrow$RGB are optional.
  • Figure 2: Synthetic data comparison of our RGB$\rightarrow$X model against previous methods InteriorVerseOrdinalShading and a known ground truth. All input images and ground truths are from Hypersim, except for the classroom scene (c).
  • Figure 3: Real-data comparison of our RGB$\rightarrow$X model to previous methods.
  • Figure 4: Our X$\rightarrow$RGB result on the synthetic kitchen scene kitchen which is not part of our training data. We rendered all intrinsic channels, shown on the left, and fed them into the model, along with a text prompt. The result matches the path-traced reference well. There are some differences, e.g., X$\rightarrow$RGB makes the stove brighter than the requested albedo, likely because dark metallic materials are rare in the training data.
  • Figure 5: X$\rightarrow$RGB synthesis given normal and albedo channels only, demonstrating lighting and color-control use of text prompts. (a) Starting from normal and albedo only, we show that the lighting can be controlled by text prompts to some extent. (b) Starting from normal and albedo only, we similarly show the color of objects can be controlled by text prompts to some extent.
  • ...and 5 more figures