Urban Scene Diffusion through Semantic Occupancy Map
Junge Zhang, Qihang Zhang, Li Zhang, Ramana Rao Kompella, Gaowen Liu, Bolei Zhou
TL;DR
The paper addresses large-scale urban scene generation by grounding 3D content in geometry and semantics rather than appearance alone. It introduces UrbanDiffusion, a BEV-conditioned 3D diffusion model operating in a latent 3D VQVAE space to produce semantic occupancy maps, plus a scene-extension module that enables unbounded generation by stitching local frames with temporal consistency. Quantitative and qualitative analyses on nuScenes and simulator-derived BEVs demonstrate strong performance and generalization, with ablations validating design choices such as using concatenated BEV conditioning and discrete VQ-VAE representations. Beyond generation, the method serves as a prior for downstream tasks like point-cloud segmentation augmentation and scene-image synthesis via SDS, indicating practical utility for simulation and urban scene understanding.
Abstract
Generating unbounded 3D scenes is crucial for large-scale scene understanding and simulation. Urban scenes, unlike natural landscapes, consist of various complex man-made objects and structures such as roads, traffic signs, vehicles, and buildings. To create a realistic and detailed urban scene, it is crucial to accurately represent the geometry and semantics of the underlying objects, going beyond their visual appearance. In this work, we propose UrbanDiffusion, a 3D diffusion model that is conditioned on a Bird's-Eye View (BEV) map and generates an urban scene with geometry and semantics in the form of semantic occupancy map. Our model introduces a novel paradigm that learns the data distribution of scene-level structures within a latent space and further enables the expansion of the synthesized scene into an arbitrary scale. After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes given the BEV maps from the held-out set and also generalize to the synthesized maps from a driving simulator. We further demonstrate its application to scene image synthesis with a pretrained image generator as a prior.
