Table of Contents
Fetching ...

CityX: Controllable Procedural Content Generation for Unbounded 3D Cities

Shougao Zhang, Mengqi Zhou, Yuxi Wang, Chuanchen Luo, Rongyu Wang, Yiwei Li, Zhaoxiang Zhang, Junran Peng

TL;DR

CityX addresses the challenge of producing diverse, controllable, and high-fidelity 3D urban scenes for embodied AI by combining procedural content generation with a universal PCG management protocol and a multi-agent, LLM-driven orchestration framework. The system translates multimodal inputs (OSM, semantic maps, satellite imagery) into executable Blender programs via four coordinating agents, with a visual feedback loop to refine results. It introduces an infinite asset library and a synthetic urban dataset rendered with Cycles, enabling scalable, simulation-ready environments. Experimental results show superior controllability, realism, and editing capabilities compared with prior works, underscoring CityX’s potential as a robust foundation for urban simulators and embodied intelligence research.

Abstract

Urban areas, as the primary human habitat in modern civilization, accommodate a broad spectrum of social activities. With the surge of embodied intelligence, recent years have witnessed an increasing presence of physical agents in urban areas, such as autonomous vehicles and delivery robots. As a result, practitioners significantly value crafting authentic, simulation-ready 3D cities to facilitate the training and verification of such agents. However, this task is quite challenging. Current generative methods fall short in either diversity, controllability, or fidelity. In this work, we resort to the procedural content generation (PCG) technique for high-fidelity generation. It assembles superior assets according to empirical rules, ultimately leading to industrial-grade outcomes. To ensure diverse and self contained creation, we design a management protocol to accommodate extensive PCG plugins with distinct functions and interfaces. Based on this unified PCG library, we develop a multi-agent framework to transform multi-modal instructions, including OSM, semantic maps, and satellite images, into executable programs. The programs coordinate relevant plugins to construct the 3D city consistent with the control condition. A visual feedback scheme is introduced to further refine the initial outcomes. Our method, named CityX, demonstrates its superiority in creating diverse, controllable, and realistic 3D urban scenes. The synthetic scenes can be seamlessly deployed as a real-time simulator and an infinite data generator for embodied intelligence research. Our project page: https://cityx-lab.github.io.

CityX: Controllable Procedural Content Generation for Unbounded 3D Cities

TL;DR

CityX addresses the challenge of producing diverse, controllable, and high-fidelity 3D urban scenes for embodied AI by combining procedural content generation with a universal PCG management protocol and a multi-agent, LLM-driven orchestration framework. The system translates multimodal inputs (OSM, semantic maps, satellite imagery) into executable Blender programs via four coordinating agents, with a visual feedback loop to refine results. It introduces an infinite asset library and a synthetic urban dataset rendered with Cycles, enabling scalable, simulation-ready environments. Experimental results show superior controllability, realism, and editing capabilities compared with prior works, underscoring CityX’s potential as a robust foundation for urban simulators and embodied intelligence research.

Abstract

Urban areas, as the primary human habitat in modern civilization, accommodate a broad spectrum of social activities. With the surge of embodied intelligence, recent years have witnessed an increasing presence of physical agents in urban areas, such as autonomous vehicles and delivery robots. As a result, practitioners significantly value crafting authentic, simulation-ready 3D cities to facilitate the training and verification of such agents. However, this task is quite challenging. Current generative methods fall short in either diversity, controllability, or fidelity. In this work, we resort to the procedural content generation (PCG) technique for high-fidelity generation. It assembles superior assets according to empirical rules, ultimately leading to industrial-grade outcomes. To ensure diverse and self contained creation, we design a management protocol to accommodate extensive PCG plugins with distinct functions and interfaces. Based on this unified PCG library, we develop a multi-agent framework to transform multi-modal instructions, including OSM, semantic maps, and satellite images, into executable programs. The programs coordinate relevant plugins to construct the 3D city consistent with the control condition. A visual feedback scheme is introduced to further refine the initial outcomes. Our method, named CityX, demonstrates its superiority in creating diverse, controllable, and realistic 3D urban scenes. The synthetic scenes can be seamlessly deployed as a real-time simulator and an infinite data generator for embodied intelligence research. Our project page: https://cityx-lab.github.io.
Paper Structure (20 sections, 9 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: The proposed CityX can generate high-quality, controllable, and editable large-scale urban scenes based on user descriptions and multimodal inputs (OSM files, semantic maps, satellite images). The generated scenes allow for the integration of dynamic elements, such as pedestrians and traffic, ensuring a fully interactive and adaptable environment.
  • Figure 2: Presentation of PCG modules involved in urban scene generation: building generation PCG effects (a), road generation spline-based PCG effects (b), and dynamic pedestrian and traffic flow generation PCG effects (c).
  • Figure 3: Multi-agent Workflow: Detailed demonstration of collaboration and communication across various stages.
  • Figure 4: For each image (a), we have a high-resolution mesh (b), which readily yields Depth (c), Surface Normals (d), Diffuse Map (e), Instance Segmentation masks (f), Ambient Occlusion (g), Cryptomatte Object/Material Mask (h/i), and Glossy Direct/Color/Indirect (j/k/l).We used a 1920 × 1080 resolution with 10,000 random samples per pixel, a standard setting in Blender that effectively eliminates sampling noise, resulting in a high-quality final image.
  • Figure 5: Comparative results on city generation. Issues with unreasonable geometry are observed in previous works, while our method performs well in generating realistic large-scale city scenes.
  • ...and 8 more figures