X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability
Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee
TL;DR
X-Scene addresses the challenge of generating large-scale driving scenes with spatial coherence by unifying multi-granular controllability, a joint occupancy–image–video diffusion pipeline, and consistent extrapolation. It introduces text-to-layout enrichment via LLMs and scene-graph diffusion, a triplane-based occupancy generator with geometry-guided image synthesis, and motion-aware video diffusion, all integrated with consistency-aware outpainting to scale scenes. The approach yields high-fidelity geometry and photorealistic appearances across large environments, with strong cross-modal alignment and favorable downstream task performance (e.g., occupancy prediction, BEV, and end-to-end planning). The work demonstrates substantial improvements over prior methods in both controllability and fidelity, enabling realistic data generation and driving-simulation applications while outlining limitations and directions for longer-horizon dynamics and broader data sources.
Abstract
Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.
