UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li; Jiazhe Guo; Hongsi Liu; Yingshuang Zou; Yikang Ding; Xiwu Chen; Hu Zhu; Feiyang Tan; Chi Zhang; Tiancai Wang; Shuchang Zhou; Li Zhang; Xiaojuan Qi; Hao Zhao; Mu Yang; Wenjun Zeng; Xin Jin

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin

TL;DR

UniScene introduces a two-stage occupancy-centric framework that first generates rich semantic occupancy from BEV layouts and then conditions multi-modal data—video and LiDAR—on this occupancy. It innovates with Gaussian-based Joint Rendering and Prior-guided Sparse Modeling to bridge occupancy to 2D video and 3D LiDAR data, respectively. The Occupancy Diffusion Transformer and Temporal-aware Occupancy VAE enable controllable, temporally coherent occupancy generation, while downstream video and LiDAR generators leverage occupancy priors for high-fidelity outputs. Extensive experiments on NuScenes show SOTA performance across occupancy, video, and LiDAR generation and improved downstream perception tasks, highlighting UniScene’s potential for unified, multi-modal data synthesis in autonomous driving.

Abstract

Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks. Project page: https://arlo0o.github.io/uniscene/

UniScene: Unified Occupancy-centric Driving Scene Generation

TL;DR

Abstract

UniScene: Unified Occupancy-centric Driving Scene Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)