Table of Contents
Fetching ...

CityCraft: A Real Crafter for 3D City Generation

Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Qixuan Huang, Mingyan Gao, Jianshu Guo, Shengyu Hao, Wenhao Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang

TL;DR

CityCraft presents a novel, three-stage pipeline for scalable, controllable 3D city generation by fusing diffusion-based 2D layout synthesis, LLM-driven land-use planning, and Blender-backed asset assembly. It introduces two public datasets (CityCraft-OSM and CityCraft-Buildings) to support layout diversity and asset realism. Through extensive quantitative and qualitative experiments, CityCraft achieves state-of-the-art layout realism and multi-view scene coherence, with ablations showing the value of ratio-based conditioning and iterative planning. The work demonstrates strong practical impact for urban planning, simulation, and virtual-city visualization, enabling rich, configurable city environments. Future work includes dynamic traffic, real-time feedback, and expanded asset libraries.

Abstract

City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

CityCraft: A Real Crafter for 3D City Generation

TL;DR

CityCraft presents a novel, three-stage pipeline for scalable, controllable 3D city generation by fusing diffusion-based 2D layout synthesis, LLM-driven land-use planning, and Blender-backed asset assembly. It introduces two public datasets (CityCraft-OSM and CityCraft-Buildings) to support layout diversity and asset realism. Through extensive quantitative and qualitative experiments, CityCraft achieves state-of-the-art layout realism and multi-view scene coherence, with ablations showing the value of ratio-based conditioning and iterative planning. The work demonstrates strong practical impact for urban planning, simulation, and virtual-city visualization, enabling rich, configurable city environments. Future work includes dynamic traffic, real-time feedback, and expanded asset libraries.

Abstract

City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.
Paper Structure (34 sections, 6 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 6 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of CityCraft. The Layout Generator generates realistic 2D city layouts based on user input conditions; the CityPlanner process the generated layouts, isolates instances, get image-level information, make land-use plans for the urban region, and select appropriate assets from assets library to craft the 3D city.
  • Figure 2: Planning and Selection Process. An example process of planning and selection. Starting from the 2D semantic layout, we isolate all instances and build a basic information dictionary for all instances based on their scale, type, spatial information, etc. (only partial information are shown in the figure for explanation, D2Road: distance to traffic roads). For each instance, we feed its information to the planner and let the planner make decisions on its characteristics. Then based on these characteristics, we retrieve the best matching candidate from the asset library and use it to craft the 3D city.
  • Figure 3: Qualitative comparison of city layouts. From top to bottom: CityCraft (ours), InfiniCity lin2023infinicity, CityDreamer xie2023citydreamer, and CityGen deng2023citygen. CityCraft shows superior detail and realism in city planning, highlighting complex road networks and diverse architectural styles.
  • Figure 4: Qualitative comparison of city scenes. From top to bottom: CityCraft (ours), InfiniCity lin2023infinicity, CityDreamer xie2023citydreamer, CityGen deng2023citygen. CityCraft demonstrates superior architectural diversity and realism, leveraging Real 3D Crafter technology for direct building growth and LLM-driven adaptive modeling, resulting in a more authentic and varied city landscape.
  • Figure 5: City functionality distribution with different prompts. Commercial zones are mainly for business and public services, with strong public infrastructure. Residential zones focus on living spaces, supplemented by key urban functions like healthcare and education.
  • ...and 4 more figures