Urban Scene Diffusion through Semantic Occupancy Map

Junge Zhang; Qihang Zhang; Li Zhang; Ramana Rao Kompella; Gaowen Liu; Bolei Zhou

Urban Scene Diffusion through Semantic Occupancy Map

Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou

TL;DR

The paper addresses large-scale urban scene generation by grounding 3D content in geometry and semantics rather than appearance alone. It introduces UrbanDiffusion, a BEV-conditioned 3D diffusion model operating in a latent 3D VQVAE space to produce semantic occupancy maps, plus a scene-extension module that enables unbounded generation by stitching local frames with temporal consistency. Quantitative and qualitative analyses on nuScenes and simulator-derived BEVs demonstrate strong performance and generalization, with ablations validating design choices such as using concatenated BEV conditioning and discrete VQ-VAE representations. Beyond generation, the method serves as a prior for downstream tasks like point-cloud segmentation augmentation and scene-image synthesis via SDS, indicating practical utility for simulation and urban scene understanding.

Abstract

Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.

Urban Scene Diffusion through Semantic Occupancy Map

TL;DR

Abstract

Paper Structure (15 sections, 11 equations, 8 figures, 4 tables)

This paper contains 15 sections, 11 equations, 8 figures, 4 tables.

Introduction
Related Work
Method
Preliminary
Latent Diffusion for Semantic Occupancy Map
Scene Extension Module
Scene Image Synthesis
Experiments
BEV-conditional Generation
Quantitative Evaluation
Ablation study
Point cloud segmentation
Scene Image Synthesis
Conclusion
Limitations.

Figures (8)

Figure 1: Diverse individual scenes and large-scale scenes generated by UrbanDiffusion. A scene is represented by the semantic occupancy map and the color labels indicate different semantic categories. The input BEV layout is also attached as a reference.
Figure 2: Framework of UrbanDiffusion. An autoencoder with 3D VQVAE architecture is trained to embed semantic occupancy maps into a latent space (top). A random latent code is gradually diffused by a BEV-conditioned denoising procedure and then decoded into a semantic occupancy map (bottom).
Figure 3: Different ways of BEV condition injection.
Figure 4: Illustration of the scene expansion. After projecting the generated sample $\mathbf{x}_t$ to the next frame via the ego poses $P_t$ and $P_{t+1}$ at time $t$ and $t+1$, and the BEV maps, we could get the overlap part and further encode both the $\mathbf{x}_{masked}$ and BEV to guide the generation process for the output sample $\mathbf{x}_{t+1}$ with high temporal consistency. Finally, we merge the sample $\mathbf{x}_{t+1}$ into the global scene $G_t$ with 'keep' for the original scene , 'update' for the intersection part by re-registering the labels of occupancy grids and 'generate' for the new part.
Figure 5: Scenes generated from BEV maps sampled from nuScenes validation set (a), from Waymo Motion Dataset waymo(b), procedurally generated by MetaDrive simulator metadrive (c), and from nuPlan Dataset nuplan. Large-scale scenes are generated from bev maps extracted from driving logs in nuScenes (e) and MetaDrvie simulator(f). We also demonstrate diverse scenes generated conditioned on same BEV maps in (g).
...and 3 more figures

Urban Scene Diffusion through Semantic Occupancy Map

TL;DR

Abstract

Urban Scene Diffusion through Semantic Occupancy Map

Authors

TL;DR

Abstract

Table of Contents

Figures (8)