Table of Contents
Fetching ...

Urban Scene Diffusion through Semantic Occupancy Map

Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou

TL;DR

The paper addresses large-scale urban scene generation by grounding 3D content in geometry and semantics rather than appearance alone. It introduces UrbanDiffusion, a BEV-conditioned 3D diffusion model operating in a latent 3D VQVAE space to produce semantic occupancy maps, plus a scene-extension module that enables unbounded generation by stitching local frames with temporal consistency. Quantitative and qualitative analyses on nuScenes and simulator-derived BEVs demonstrate strong performance and generalization, with ablations validating design choices such as using concatenated BEV conditioning and discrete VQ-VAE representations. Beyond generation, the method serves as a prior for downstream tasks like point-cloud segmentation augmentation and scene-image synthesis via SDS, indicating practical utility for simulation and urban scene understanding.

Abstract

Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.

Urban Scene Diffusion through Semantic Occupancy Map

TL;DR

The paper addresses large-scale urban scene generation by grounding 3D content in geometry and semantics rather than appearance alone. It introduces UrbanDiffusion, a BEV-conditioned 3D diffusion model operating in a latent 3D VQVAE space to produce semantic occupancy maps, plus a scene-extension module that enables unbounded generation by stitching local frames with temporal consistency. Quantitative and qualitative analyses on nuScenes and simulator-derived BEVs demonstrate strong performance and generalization, with ablations validating design choices such as using concatenated BEV conditioning and discrete VQ-VAE representations. Beyond generation, the method serves as a prior for downstream tasks like point-cloud segmentation augmentation and scene-image synthesis via SDS, indicating practical utility for simulation and urban scene understanding.

Abstract

Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.
Paper Structure (15 sections, 11 equations, 8 figures, 4 tables)

This paper contains 15 sections, 11 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Diverse individual scenes and large-scale scenes generated by UrbanDiffusion. A scene is represented by the semantic occupancy map and the color labels indicate different semantic categories. The input BEV layout is also attached as a reference.
  • Figure 2: Framework of UrbanDiffusion. An autoencoder with 3D VQVAE architecture is trained to embed semantic occupancy maps into a latent space (top). A random latent code is gradually diffused by a BEV-conditioned denoising procedure and then decoded into a semantic occupancy map (bottom).
  • Figure 3: Different ways of BEV condition injection.
  • Figure 4: Illustration of the scene expansion. After projecting the generated sample $\mathbf{x}_t$ to the next frame via the ego poses $P_t$ and $P_{t+1}$ at time $t$ and $t+1$, and the BEV maps, we could get the overlap part and further encode both the $\mathbf{x}_{masked}$ and BEV to guide the generation process for the output sample $\mathbf{x}_{t+1}$ with high temporal consistency. Finally, we merge the sample $\mathbf{x}_{t+1}$ into the global scene $G_t$ with 'keep' for the original scene , 'update' for the intersection part by re-registering the labels of occupancy grids and 'generate' for the new part.
  • Figure 5: Scenes generated from BEV maps sampled from nuScenes validation set (a), from Waymo Motion Dataset waymo(b), procedurally generated by MetaDrive simulator metadrive (c), and from nuPlan Dataset nuplan. Large-scale scenes are generated from bev maps extracted from driving logs in nuScenes (e) and MetaDrvie simulator(f). We also demonstrate diverse scenes generated conditioned on same BEV maps in (g).
  • ...and 3 more figures