Table of Contents
Fetching ...

Generative World Renderer

Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang

Abstract

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

Generative World Renderer

Abstract

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

Paper Structure

This paper contains 31 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We present a large-scale dataset curated from game engines to support scalable generative world rendering. The dataset provides high-resolution RGB videos with aligned G-buffers, covering continuous and dynamic scenes, long temporal trajectories, and diverse visual conditions.
  • Figure 2: Motivation. Existing approaches, such as DiffusionRenderer, are primarily trained on synthetic datasets, struggling to capture (a) complex reflection and illumination effects, (b) real scene elements (e.g., humans and cars), (c) fine-grained visual details, (d) dynamic motion, and (e) long-range temporal dependencies in long video sequences. By contrast, the proposed dataset provides high-fidelity, scene-level supervision with rich geometry, appearance, and temporal dynamics, enabling scalable generative video inverse rendering that generalizes more effectively to real-world scenarios.
  • Figure 3: Pipeline. Stage I: We curate video sequences containing RGB frames and five corresponding G-buffer channels from commercial game engines. Buffer interception is performed using ReShade, which allows us to capture intermediate rendering outputs at runtime. Because a single rendering pass exposes thousands of heterogeneous and largely irrelevant buffers, an automated filtering procedure is required to identify valid candidates. To disambiguate target G-buffers from irrelevant render targets, we utilize RenderDoc for offline inspection to define filtering rules based on metadata invariants. By manually validating the semantic accuracy of the remaining buffers, we iteratively optimize these rules into robust signatures that ensure consistent runtime identification. As a final verification step, we re-render the RGB frames from the collected G-buffers using a deferred shading pipeline and check for pixel-level consistency with the original RGB outputs. Stage II: We annotate meta-information for each sequence and filter out unsatisfactory frames based on quality criteria. Stage III: We synthesize motion blur to enhance temporal realism and better match real-world capture conditions.
  • Figure 4: Qualitative comparison of inverse rendering on in-the-wild data. (Top to bottom: albedo, normal, depth, metallic, roughness). Our method significantly outperforms DiffusionRenderer in inverse rendering. It produces cleaner albedo with better delighting, artifact-free geometry, and robust material predictions that effectively resist complex outdoor illumination and atmospheric disruptions like smoke.
  • Figure 5: Qualitative comparison of inverse rendering on in-the-wild data.
  • ...and 4 more figures