Table of Contents
Fetching ...

Diffusion-based G-buffer generation and rendering

Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri

TL;DR

This work tackles the limited editability of text-to-image diffusion systems by introducing a diffusion-based two-stage pipeline that first generates a G-buffer from a text prompt and then renders a final image with a modular neural renderer. A partially frozen diffusion backbone paired with a ControlNet produces geometry, material, and lighting channels (albedo, normals, depth, roughness, metallic, irradiance) which can be edited or augmented via channel-level operations, object insertion, or masking for lighting changes. The second-stage renderer employs geometry, material, and lighting sub-networks that follow a physically based rendering decomposition, improving realism for reflections, shadows, and transparency, and enabling post-generation edits without re-running full diffusion. The approach demonstrates enhanced editability and generalization from indoor to outdoor scenes, while preserving the broad capabilities of large pre-trained models, and it uses mask-guided fine-tuning and a branching architecture to maintain stability during training and rendering.

Abstract

Despite recent advances in text-to-image generation, controlling geometric layout and material properties in synthesized scenes remains challenging. We present a novel pipeline that first produces a G-buffer (albedo, normals, depth, roughness, and metallic) from a text prompt and then renders a final image through a modular neural network. This intermediate representation enables fine-grained editing: users can copy and paste within specific G-buffer channels to insert or reposition objects, or apply masks to the irradiance channel to adjust lighting locally. As a result, real objects can be seamlessly integrated into virtual scenes, and virtual objects can be placed into real environments with high fidelity. By separating scene decomposition from image rendering, our method offers a practical balance between detailed post-generation control and efficient text-driven synthesis. We demonstrate its effectiveness on a variety of examples, showing that G-buffer editing significantly extends the flexibility of text-guided image generation.

Diffusion-based G-buffer generation and rendering

TL;DR

This work tackles the limited editability of text-to-image diffusion systems by introducing a diffusion-based two-stage pipeline that first generates a G-buffer from a text prompt and then renders a final image with a modular neural renderer. A partially frozen diffusion backbone paired with a ControlNet produces geometry, material, and lighting channels (albedo, normals, depth, roughness, metallic, irradiance) which can be edited or augmented via channel-level operations, object insertion, or masking for lighting changes. The second-stage renderer employs geometry, material, and lighting sub-networks that follow a physically based rendering decomposition, improving realism for reflections, shadows, and transparency, and enabling post-generation edits without re-running full diffusion. The approach demonstrates enhanced editability and generalization from indoor to outdoor scenes, while preserving the broad capabilities of large pre-trained models, and it uses mask-guided fine-tuning and a branching architecture to maintain stability during training and rendering.

Abstract

Despite recent advances in text-to-image generation, controlling geometric layout and material properties in synthesized scenes remains challenging. We present a novel pipeline that first produces a G-buffer (albedo, normals, depth, roughness, and metallic) from a text prompt and then renders a final image through a modular neural network. This intermediate representation enables fine-grained editing: users can copy and paste within specific G-buffer channels to insert or reposition objects, or apply masks to the irradiance channel to adjust lighting locally. As a result, real objects can be seamlessly integrated into virtual scenes, and virtual objects can be placed into real environments with high fidelity. By separating scene decomposition from image rendering, our method offers a practical balance between detailed post-generation control and efficient text-driven synthesis. We demonstrate its effectiveness on a variety of examples, showing that G-buffer editing significantly extends the flexibility of text-guided image generation.

Paper Structure

This paper contains 37 sections, 11 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview. Our pipeline begins with a random noise sample and a text prompt. These inputs are processed by the stage-1 network, which consists of two denoising steps: first, a frozen Stable Diffusion 2 model (in gray), followed by a fine-tuned Stable Diffusion 2 model augmented with ControlNet. Stage 1 produces a G-buffer comprising albedo, normal, depth, irradiance, roughness, and metallic. These channels are then grouped and passed to the stage-2 network, where an optional mask is used for object movement or insertion. Each group is processed by specialized sub-networks, fused by a final grouping module, and then fed into another ControlNet-equipped, fine-tuned Stable Diffusion 2 model to generate the final RGB output.
  • Figure 2: Text-to-G-buffer Ablation. This figure compares the performance of three text-to-G-buffer generation approaches across three example scenes (rows), with all images depicting normal maps. The first column shows results from linking the RGBX network to the full Stable Diffusion pipeline, using the same noise, seed, and generator as our method. The second column presents outcomes from directly training the Stable Diffusion UNet without ControlNet. The third column showcases results from our full method, demonstrating its superior performance compared to the alternatives.
  • Figure 3: Ablation of G-buffer to Final Image with or without Branch Networks. This figure illustrates the impact of Branch Networks on g-buffer rendering. Results show that including Branch Networks produces outputs more closely aligned with the ground truth. All g-buffers and ground-truth images are from the Hypersim dataset.
  • Figure 4: Comparison with Diffusion Handle. In this figure, we compare object movement results between our method and Diffusion Handle. Our approach consistently achieves higher-quality outputs with minimal background alterations, whereas Diffusion Handle exhibits more pronounced background changes and underperforms under extreme lighting conditions.
  • Figure 5: s User study results (156 participants). Each bar illustrates the percentage of participants who preferred either our method or the baseline methods across various evaluation criteria. Participants compared pairs of static images generated by our method and the respective baseline (RGBX or Diffusion Handles). Three evaluation questions were utilized for comparisons with RGBX, while two questions were employed for comparisons with Diffusion Handles due to technical limitations in rendering buffers with Diffusion Handles.
  • ...and 3 more figures