MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo, Junyan Ye, Weijia Li, Yiping Chen, Ting Han
TL;DR
MajutsuCity presents a language-driven, aesthetically adaptive framework for scalable and controllable 3D city generation, featuring a four-stage pipeline that yields layout/height maps, assets, and materials, plus an interactive editing agent and a high-quality multimodal dataset. It combines LongCLIP- and ControlNet-based layout/height synthesis with bottom-up asset and material generation, culminating in a renderable city assembled from semantically aligned layers. A novel editing agent enables object-level Add/Delete/Edit/Move/Replace operations, facilitating iterative refinement, while a VLM-based evaluation framework (AQS and RDR) provides robust, multi-dimensional benchmarking. The approach achieves state-of-the-art results in geometry fidelity, stylistic adaptability, and semantic controllability, backed by extensive datasets and metrics that can spur future research and practical workflows in large-scale, text-guided 3D city synthesis.
Abstract
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our project page: https://longhz140516.github.io/MajutsuCity/.
