Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li
TL;DR
Yo'City introduces a novel agentic framework for personalized, boundless 3D city generation that leverages a City–District–Grid hierarchy and a produce–refine–evaluate loop to create scalable, high-fidelity urban scenes. By coupling a Global Planner and Local Designer with an isometric image-to-3D pipeline and a scene-graph–driven Expansion Module, the method achieves both global structural coherence and local architectural richness. A Retrieval-Augmented Grounding (RAG) approach grounds planning in real-world city patterns, while parallel tile generation enables rapid, large-scale city synthesis without map data. The authors provide a multi-dimensional evaluation benchmark and demonstrate state-of-the-art performance across semantic alignment, geometric fidelity, texture clarity, layout coherence, scene coverage, and realism, underscoring the practical potential for immersive VR, digital twins, and game-like applications.
Abstract
Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.
