Table of Contents
Fetching ...

FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

TL;DR

FrameDiffuser provides autoregressive, G-buffer–conditioned diffusion for frame-by-frame neural rendering in interactive apps, addressing temporal inconsistency and whole-sequence requirements of prior methods. Its dual-conditioning architecture combines ControlNet for geometry/irradiance with ControlLoRA for temporal coherence, guided by a three-stage training process to mitigate drift. Environment-specific specialization yields superior photorealism—lighting, shadows, and reflections—compared to generalized approaches, while maintaining practical inference speeds. This work advocates integrating neural augmentation with traditional rendering, enabling consistent, artistically controllable visuals in interactive environments.

Abstract

Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

TL;DR

FrameDiffuser provides autoregressive, G-buffer–conditioned diffusion for frame-by-frame neural rendering in interactive apps, addressing temporal inconsistency and whole-sequence requirements of prior methods. Its dual-conditioning architecture combines ControlNet for geometry/irradiance with ControlLoRA for temporal coherence, guided by a three-stage training process to mitigate drift. Environment-specific specialization yields superior photorealism—lighting, shadows, and reflections—compared to generalized approaches, while maintaining practical inference speeds. This work advocates integrating neural augmentation with traditional rendering, enabling consistent, artistically controllable visuals in interactive environments.

Abstract

Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

Paper Structure

This paper contains 24 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: G-buffer to Photorealistic Rendering. FrameDiffuser transforms geometric and material data from G-buffer into photorealistic rendered images with realistic global illumination (GI), shadows, and reflections. Our autoregressive approach maintains temporal consistency for long sequences, enabling neural rendering for interactive applications. Project page: https://framediffuser.jdihlmann.com/
  • Figure 2: FrameDiffuser Architecture with dual conditioning: ControlNet processes 10-channel input comprising 9 G-buffer channels for structural guidance and 1 pred. irradiance channel for lighting guidance, computed from the previous frame's model output and basecolor. ControlLoRA conditions on the previous frame encoded in VAE latent space for temporal coherence. The generated output at time $t$ is used to compute the irradiance input for the next frame at time $t+1$, enabling autoregressive frame generation. The encoder $\mathcal{E}$ and decoder $\mathcal{D}$ represent the VAE components operating in latent space. The training strategy on the right shows our three-stage approach: first, we train ControlNet on the G-buffer to image translation task without irradiance. Second, we add ControlLoRA and irradiance for temporal conditioning. Third, we train autoregressively using the model's own generated frames as previous-frame inputs to make the model robust against its own generation errors.
  • Figure 3: Qualitative Results showing FrameDiffuser's autoregressive generation across multiple frames, including long-term stability at frame 4000. From top to bottom: ground truth, our autoregressive output, and the G-buffer channels (basecolor, Normal, Depth, Roughness, Metallic) alongside computed Irradiance. The model maintains temporal consistency and accurate material properties across extended sequences.
  • Figure 4: Qualitative Comparison with X→RGB across Downtown West (urban) and Hillside Sample (indoor) environments. Our method achieves high-detail lighting while maintaining temporal consistency across frames over long sequences, while X→RGB applies more uniform lighting.
  • Figure 5: Temporal Stability Analysis over 3000+ consecutive validation frames of pure autoregressive generation inside the Hillside Sample Project Environment hillsidesample Figure \ref{['fig:comparison']}. We compare our model output against pure VAE reconstruction to measure encoder degradation. Metrics show degradation between frames 800--1700 when the camera enters very dark rooms; the model's bias towards lit areas causes it to insufficiently capture extreme darkness. After frame 3000, all metrics degrade when camera movement reduces and G-buffer changes become minimal, causing error accumulation.
  • ...and 6 more figures