Table of Contents
Fetching ...

Controllable 3D Outdoor Scene Generation via Scene Graphs

Yuheng Liu, Xinke Li, Yuning Zhang, Lu Qi, Xin Li, Wenping Wang, Chongshou Li, Xueting Li, Ming-Hsuan Yang

TL;DR

The paper tackles controllable 3D outdoor scene generation conditioned on scene graphs. It introduces a three-part pipeline: formal scene-graph formulation, a Scene-graph-guided diffusion framework with a Graph Attention Network producing Context-Aware Node Embeddings and a BEV Embedding Map via an Allocation Module, and a two-stage diffusion process (2D then 3D) trained with auxiliary losses $L_a$ and $L_ heta$, jointly with an interactive system for graph creation. A new CarlaSG dataset pairs 3D scenes with scene graphs derived from CarlaSC, enabling training and evaluation of the approach. Results show high fidelity to scene-graph specifications, precise object counts and road-type control, and diverse yet consistent outputs, outperforming SG2Im and LLM baselines in control capacity and alignment. This work offers a scalable, user-friendly pathway to realistic, graph-guided outdoor 3D synthesis with broad potential in autonomous driving, gaming, and simulation.

Abstract

Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.

Controllable 3D Outdoor Scene Generation via Scene Graphs

TL;DR

The paper tackles controllable 3D outdoor scene generation conditioned on scene graphs. It introduces a three-part pipeline: formal scene-graph formulation, a Scene-graph-guided diffusion framework with a Graph Attention Network producing Context-Aware Node Embeddings and a BEV Embedding Map via an Allocation Module, and a two-stage diffusion process (2D then 3D) trained with auxiliary losses and , jointly with an interactive system for graph creation. A new CarlaSG dataset pairs 3D scenes with scene graphs derived from CarlaSC, enabling training and evaluation of the approach. Results show high fidelity to scene-graph specifications, precise object counts and road-type control, and diverse yet consistent outputs, outperforming SG2Im and LLM baselines in control capacity and alignment. This work offers a scalable, user-friendly pathway to realistic, graph-guided outdoor 3D synthesis with broad potential in autonomous driving, gaming, and simulation.

Abstract

Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.

Paper Structure

This paper contains 13 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Scene Graph Guided 3D Outdoor Scene Generation. Compared to text descriptions and BEV maps, scene graphs offer a more intuitive and user-friendly format for controlling 3D scene generation. We also develop an interactive system that allows users to generate/edit dense 3D scenes through scene graph interaction.
  • Figure 2: Overview of Scene Graph Guided 3D Scene Generation. The Scene Graph Guided 3D Generation structure consists of three main components: the interactive system (red), BEM processing (blue), and diffusion generation (bottom). Through the interactive system, users can construct their own Scene Graphs using either an interactive interface or text interaction. The constructed scene graph is processed by a GNN, which is jointly trained with the diffusion model using auxiliary tasks to enhance control. Each node in the Scene Graph is then positioned by the Allocation Module to form the BEM. This BEM serves as a conditioning input to the 3D Pyramid Discrete Diffusion Model pdd, which generates the final 3D outdoor scene. Note that "Recon", "Classification", and "CANE" denote "Edge Reconstruction", "Node Classification", and "Context-aware Node Embedding", respectively.
  • Figure 3: Scene Graph Generation. LLMs convert the user’s prompt into a scene graph, which guides 3D scene generation.
  • Figure 4: Controlling 3D Outdoor Scene Generation with Scene Graphs. We compare baseline methods. Results show that our method generates scenes consistent with the provided scene graph, whereas the SG2Im and LLM approaches exhibit inconsistencies in object quantities and road types.
  • Figure 5: Diversity in Scene Generation. Comparison of three scenes generated by our method under the same scene graph. This demonstrates our method’s ability to produce varied yet consistent scenes based on identical input.
  • ...and 3 more figures