Table of Contents
Fetching ...

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

TL;DR

CityDreamer tackles unbounded 3D city generation by decoupling background scenery from building instances and employing BEV-based volumetric rendering. It introduces a City Background Generator with a generative hash grid and a Building Instance Generator with pixel-level local encoders and a style code, followed by a Compositor to fuse outputs into coherent city imagery. The approach is trained on CityGen datasets (OSM and GoogleEarth) to capture realistic layouts and appearances, achieving state-of-the-art results and enabling localized editing of individual buildings. This framework supports scalable, multi-view-consistent 3D city generation with practical applications in urban planning, gaming, and metaverse content creation.

Abstract

3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose \textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

TL;DR

CityDreamer tackles unbounded 3D city generation by decoupling background scenery from building instances and employing BEV-based volumetric rendering. It introduces a City Background Generator with a generative hash grid and a Building Instance Generator with pixel-level local encoders and a style code, followed by a Compositor to fuse outputs into coherent city imagery. The approach is trained on CityGen datasets (OSM and GoogleEarth) to capture realistic layouts and appearances, achieving state-of-the-art results and enabling localized editing of individual buildings. This framework supports scalable, multi-view-consistent 3D city generation with practical applications in urban planning, gaming, and metaverse content creation.

Abstract

3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose \textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.
Paper Structure (26 sections, 12 equations, 15 figures, 6 tables)

This paper contains 26 sections, 12 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The proposed CityDreamer generates a wide variety of unbounded city layouts and multi-view consistent appearances, featuring well-defined geometries and diverse styles.
  • Figure 2: Overview of CityDreamer. The unbounded layout generator creates the city layout $\mathbf{L}$. Then, the city background generator performs ray-sampling to retrieve features from $\mathbf{L}$ and generates the background image with a volumetric renderer focusing on background stuff like roads, green lands, and water areas. Similarly, the building instance generator renders the building instance image with another volumetric renderer. Finally, the compositor merges the rendered background and building images, producing a unified and coherent final image. Note that "Mod.", "Cond.", "Bg.", and "Bldg." denote "Modulation", "Condition", "Background", and "Building", respectively.
  • Figure 3: Overview of CityGen Datasets. (a) The OSM dataset comprising paired height fields and semantic maps provides real-world city layouts. (b) The city layout, generated from the height field and semantic map, facilitates automatic annotation generation. (c) The GoogleEarth dataset includes real-world city appearances alongside semantic segmentation and building instance segmentation. (d) The dataset statistics demonstrate the variety of perspectives available in the GoogleEarth dataset.
  • Figure 4: Qualitative comparison. The proposed CityDreamer produces more realistic and diverse results compared to all baselines. Note that the visual results of InfiniCity DBLP:conf/iccv/LinLMCSYT23 are provided by the authors and zoomed for optimal viewing.
  • Figure 5: User study on unbounded 3D city generation. All scores are in the range of 5, with 5 indicating the best. Note that "Pers.Nature" denotes "PersistentNature" DBLP:conf/cvpr/ChaiTLIS23.
  • ...and 10 more figures