Table of Contents
Fetching ...

Compositional Generative Model of Unbounded 4D Cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

TL;DR

CityDreamer4D introduces a compositional framework for unbounded 4D city generation that decouples static layouts from dynamic traffic and uses background and instance-specific neural fields within a BEV-based representation. The model comprises Unbounded Layout Generator, Traffic Scenario Generator, City Background Generator, Building Instance Generator, Vehicle Instance Generator, and a Compositor, enabling scalable, temporally coherent city synthesis and instance-level editing. It is backed by three datasets—OSM, GoogleEarth, and CityTopia—for realistic layouts and high-quality city visuals with 3D annotations. Empirical results show state-of-the-art performance across 4D city metrics (FID, KID, VBench, DE, CE) and strong qualitative and qualitative performance, with applications in urban simulation, stylization, and navigation benchmarks. The work advances unbounded city synthesis and provides a foundation for more complex urban-AI tools and planning simulations.

Abstract

3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.

Compositional Generative Model of Unbounded 4D Cities

TL;DR

CityDreamer4D introduces a compositional framework for unbounded 4D city generation that decouples static layouts from dynamic traffic and uses background and instance-specific neural fields within a BEV-based representation. The model comprises Unbounded Layout Generator, Traffic Scenario Generator, City Background Generator, Building Instance Generator, Vehicle Instance Generator, and a Compositor, enabling scalable, temporally coherent city synthesis and instance-level editing. It is backed by three datasets—OSM, GoogleEarth, and CityTopia—for realistic layouts and high-quality city visuals with 3D annotations. Empirical results show state-of-the-art performance across 4D city metrics (FID, KID, VBench, DE, CE) and strong qualitative and qualitative performance, with applications in urban simulation, stylization, and navigation benchmarks. The work advances unbounded city synthesis and provides a foundation for more complex urban-AI tools and planning simulations.

Abstract

3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
Paper Structure (25 sections, 18 equations, 15 figures, 6 tables)

This paper contains 25 sections, 18 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Overview of CityDreamer4D. 4D city generation comprises static and dynamic scenes, conditioned on city layout $\mathbf{L}$ and time-varying traffic scenario $\mathbf{T}_t$, generated by the Unbounded Layout and Traffic Scenario Generators, respectively. City Background Generator uses $\mathbf{L}$ to create background images $\mathbf{\hat{I}}_{G}$ for stuff like roads, vegetation, and the sky, while Building Instance Generator renders the buildings $\{\mathbf{\hat{I}}_{B_i}\}$ within the city. Using $\mathbf{T}_t$, Vehicle Instance Generator generates vehicles $\{\mathbf{\hat{I}}_{V_i}^t\}$ at time step $t$. Finally, Compositor combines the rendered background, buildings, and vehicles into a unified and coherent image $\mathbf{\hat{I}}_{C}^t$. "Gen.", "Mod.", "Cond.", "BG.", "BLDG.", and "VEH." denote "Generation", "Modulation", "Condition", "Background", "Building", and "Vehicle", respectively.
  • Figure 2: Overview of the OSM and GoogleEarth Datasets. (a) Examples of the 2D and 3D annotations in the GoogleEarth dataset, which can be automatically generated using the OSM dataset. (b) The automatic annotation pipeline can be readily adapted for worldwide cities. (c) The dataset statistics highlight the diverse perspectives in the GoogleEarth dataset.
  • Figure 3: Overview of the CityTopia Dataset. (a) The virtual city generation pipeline. "Pro.Inst.", "Sur.Spl", and "3D Inst. Anno." denote "Prototype Instantiation", "Surface Sampling", and "3D Instance Annotation", respectively. (b) Examples of 2D and 3D annotations in the CityTopia dataset are shown from both daytime and nighttime street-view and aerial-view perspectives, automatically generated during virtual city generation. (c) The dataset statistics highlight the diverse perspectives in both street and aerial views.
  • Figure 4: Qualitative Comparison on Google Earth. For SceneDreamer DBLP:journals/pami/ChenWL23 and CityDreamer4D, vehicles are generated using models trained on CityTopia due to the lack of semantic annotations for vehicles in Google Earth. For DimensionX DBLP:preprint/arxiv/2411-04928, the initial frame is provided by CityDreamer4D. The visual results of InfiniCity DBLP:conf/iccv/LinLMCS0T23, provided by the authors, have been zoomed in for better viewing. "Pers.Nature" stands for "PersistentNature" DBLP:conf/cvpr/Chai0LIS23.
  • Figure 5: Qualitative Comparison on CityTopia. The initial frame for DimensionX and the input frames for DreamScene4D are chosen from the dataset. "Pers.Nature" refers to "PersistentNature" DBLP:conf/cvpr/Chai0LIS23.
  • ...and 10 more figures