Imagine a City: CityGenAgent for Procedural 3D City Generation
Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, Zhengzhe Liu
TL;DR
CityGenAgent tackles scalable 3D city generation by introducing two domain-specific languages, Block Program and Building Program, to decouple spatial layout from architectural appearance. Two agents, BlockGen and BuildingGen, are trained with Supervised Fine-Tuning followed by Reinforcement Learning using Spatial Alignment Reward and Visual Consistency Reward to achieve robust spatial reasoning, semantic fidelity, and visual coherence. The framework supports natural language editing via the program proxies and demonstrates superior semantic alignment, geometry quality, and controllability compared with existing city-generation methods. This work lays a foundation for interactive, scalable city modeling and content creation in robotics, simulation, and VR applications.
Abstract
The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
