Controllable 3D Outdoor Scene Generation via Scene Graphs
Yuheng Liu, Xinke Li, Yuning Zhang, Lu Qi, Xin Li, Wenping Wang, Chongshou Li, Xueting Li, Ming-Hsuan Yang
TL;DR
The paper tackles controllable 3D outdoor scene generation conditioned on scene graphs. It introduces a three-part pipeline: formal scene-graph formulation, a Scene-graph-guided diffusion framework with a Graph Attention Network producing Context-Aware Node Embeddings and a BEV Embedding Map via an Allocation Module, and a two-stage diffusion process (2D then 3D) trained with auxiliary losses $L_a$ and $L_ heta$, jointly with an interactive system for graph creation. A new CarlaSG dataset pairs 3D scenes with scene graphs derived from CarlaSC, enabling training and evaluation of the approach. Results show high fidelity to scene-graph specifications, precise object counts and road-type control, and diverse yet consistent outputs, outperforming SG2Im and LLM baselines in control capacity and alignment. This work offers a scalable, user-friendly pathway to realistic, graph-guided outdoor 3D synthesis with broad potential in autonomous driving, gaming, and simulation.
Abstract
Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.
