Table of Contents
Fetching ...

X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee

TL;DR

X-Scene addresses the challenge of generating large-scale driving scenes with spatial coherence by unifying multi-granular controllability, a joint occupancy–image–video diffusion pipeline, and consistent extrapolation. It introduces text-to-layout enrichment via LLMs and scene-graph diffusion, a triplane-based occupancy generator with geometry-guided image synthesis, and motion-aware video diffusion, all integrated with consistency-aware outpainting to scale scenes. The approach yields high-fidelity geometry and photorealistic appearances across large environments, with strong cross-modal alignment and favorable downstream task performance (e.g., occupancy prediction, BEV, and end-to-end planning). The work demonstrates substantial improvements over prior methods in both controllability and fidelity, enabling realistic data generation and driving-simulation applications while outlining limitations and directions for longer-horizon dynamics and broader data sources.

Abstract

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.

X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

TL;DR

X-Scene addresses the challenge of generating large-scale driving scenes with spatial coherence by unifying multi-granular controllability, a joint occupancy–image–video diffusion pipeline, and consistent extrapolation. It introduces text-to-layout enrichment via LLMs and scene-graph diffusion, a triplane-based occupancy generator with geometry-guided image synthesis, and motion-aware video diffusion, all integrated with consistency-aware outpainting to scale scenes. The approach yields high-fidelity geometry and photorealistic appearances across large environments, with strong cross-modal alignment and favorable downstream task performance (e.g., occupancy prediction, BEV, and end-to-end planning). The work demonstrates substantial improvements over prior methods in both controllability and fidelity, enabling realistic data generation and driving-simulation applications while outlining limitations and directions for longer-horizon dynamics and broader data sources.

Abstract

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.

Paper Structure

This paper contains 58 sections, 3 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of $\mathcal{X}$-Scene, a unified world generator that supports multi-granular controllability through high-level text-to-layout generation and low-level BEV layout conditioning. It performs joint occupancy, image, and video generation for 3D scene synthesis and reconstruction with high fidelity.
  • Figure 2: Pipeline of $\mathcal{X}$-Scene for driving scene generation: (a) Multi-granular controllability supports both high-level text prompts and low-level geometric constraints for flexible specification; (b) Joint occupancy-image-video generation synthesizes aligned 3D voxels and multi-view images and videos via conditional diffusion; (c) Large-scale extrapolation enables coherent scene expansion through consistency-aware outpainting (Fig. \ref{['fig:outpaint']}). Fig. \ref{['fig:text2layout']} details the scene-graph to layout diffusion.
  • Figure 3: Pipeline of textual description enrichment and scene-graph to layout generation: (a) Input prompts are enriched using RAG-augmented LLMs to produce structured scene descriptions; (b) Spatial relationships are converted into a scene graph and encoded with a graph network, followed by conditional diffusion that denoises object boxes and lane polylines into the final layouts.
  • Figure 4: Illustration of (a) consistency-aware outpainting: (b) Occupancy triplane extrapolation is decomposed into three 2D plane extensions guided by overlapped regions; (c) Image extrapolation is performed via diffusion conditioned on images and camera parameters.
  • Figure 5: Versatile generation capability of $\mathcal{X}$-Scene: (a) Generation of large-scale, consistent semantic occupancy and multi-view images, which are reconstructed into 3D scenes for multi-view rendering; (b) User-prompted layout and scene generation, along with scene geometry editing.
  • ...and 6 more figures