Table of Contents
Fetching ...

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin

TL;DR

UniScene introduces a two-stage occupancy-centric framework that first generates rich semantic occupancy from BEV layouts and then conditions multi-modal data—video and LiDAR—on this occupancy. It innovates with Gaussian-based Joint Rendering and Prior-guided Sparse Modeling to bridge occupancy to 2D video and 3D LiDAR data, respectively. The Occupancy Diffusion Transformer and Temporal-aware Occupancy VAE enable controllable, temporally coherent occupancy generation, while downstream video and LiDAR generators leverage occupancy priors for high-fidelity outputs. Extensive experiments on NuScenes show SOTA performance across occupancy, video, and LiDAR generation and improved downstream perception tasks, highlighting UniScene’s potential for unified, multi-modal data synthesis in autonomous driving.

Abstract

Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks. Project page: https://arlo0o.github.io/uniscene/

UniScene: Unified Occupancy-centric Driving Scene Generation

TL;DR

UniScene introduces a two-stage occupancy-centric framework that first generates rich semantic occupancy from BEV layouts and then conditions multi-modal data—video and LiDAR—on this occupancy. It innovates with Gaussian-based Joint Rendering and Prior-guided Sparse Modeling to bridge occupancy to 2D video and 3D LiDAR data, respectively. The Occupancy Diffusion Transformer and Temporal-aware Occupancy VAE enable controllable, temporally coherent occupancy generation, while downstream video and LiDAR generators leverage occupancy priors for high-fidelity outputs. Extensive experiments on NuScenes show SOTA performance across occupancy, video, and LiDAR generation and improved downstream perception tasks, highlighting UniScene’s potential for unified, multi-modal data synthesis in autonomous driving.

Abstract

Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks. Project page: https://arlo0o.github.io/uniscene/

Paper Structure

This paper contains 27 sections, 19 equations, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Overall framework of the proposed method. The joint generation process is organized into an occupancy-centric hierarchy: I. Controllable Occupancy Generation. The BEV layouts are concatenated with the noise volumes before being fed into the Occupancy Diffusion Transformer, and decoded with the Occupancy VAE Decoder $\mathcal{D}_\mathrm{occ}$. II. Occupancy-based Video and LiDAR Generation. The occupancy is converted into 3D Gaussians and rendered into semantic and depth maps, which are processed with additional encoders as in ControlNet. The output is obtained from the Video VAE Decoder $\mathcal{D}_\mathrm{vid}$. For LiDAR generation, the occupancy is processed via a sparse UNet and sampled with the geometric prior guidance, which is sent to the LiDAR head $\mathcal{D}_\mathrm{lid}$ for generation.
  • Figure 2: Visualization of the Gaussian-based joint rendering.
  • Figure 3: (a) Sparse sampling with occupancy-based prior guidance. (b) Visualization of the effect on LiDAR ray-dropping head.
  • Figure 4: Versatile generation ability of UniScene. (a) Large-scale coherent generation of semantic occupancy, LiDAR point clouds, and multi-view videos. (b) Controllable generation of geometry-edited occupancy, video, and LiDAR by simply editing the input BEV layouts to convey user commands. (c) Controllable generation of attribute-diverse videos by changing the input text prompts.
  • Figure 5: Qualitative evaluation for occupancy forecasting. Our method can compellingly handle sharp steering maneuvers and dynamic objects with temporal consistency.
  • ...and 17 more figures