Table of Contents
Fetching ...

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Keyang Lu, Sifan Zhou, Hongbin Xu, Gang Xu, Zhifei Yang, Yikai Wang, Zhen Xiao, Jieyi Long, Ming Li

TL;DR

Yo'City introduces a novel agentic framework for personalized, boundless 3D city generation that leverages a City–District–Grid hierarchy and a produce–refine–evaluate loop to create scalable, high-fidelity urban scenes. By coupling a Global Planner and Local Designer with an isometric image-to-3D pipeline and a scene-graph–driven Expansion Module, the method achieves both global structural coherence and local architectural richness. A Retrieval-Augmented Grounding (RAG) approach grounds planning in real-world city patterns, while parallel tile generation enables rapid, large-scale city synthesis without map data. The authors provide a multi-dimensional evaluation benchmark and demonstrate state-of-the-art performance across semantic alignment, geometric fidelity, texture clarity, layout coherence, scene coverage, and realism, underscoring the practical potential for immersive VR, digital twins, and game-like applications.

Abstract

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

TL;DR

Yo'City introduces a novel agentic framework for personalized, boundless 3D city generation that leverages a City–District–Grid hierarchy and a produce–refine–evaluate loop to create scalable, high-fidelity urban scenes. By coupling a Global Planner and Local Designer with an isometric image-to-3D pipeline and a scene-graph–driven Expansion Module, the method achieves both global structural coherence and local architectural richness. A Retrieval-Augmented Grounding (RAG) approach grounds planning in real-world city patterns, while parallel tile generation enables rapid, large-scale city synthesis without map data. The authors provide a multi-dimensional evaluation benchmark and demonstrate state-of-the-art performance across semantic alignment, geometric fidelity, texture clarity, layout coherence, scene coverage, and realism, underscoring the practical potential for immersive VR, digital twins, and game-like applications.

Abstract

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

Paper Structure

This paper contains 31 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: A vast city generated by Yo'City. It incorporates key elements of a modern metropolis while also featuring more personalized designs, such as a Harry Potter–themed park and a minimalist shopping mall. The zoomed-in views of them are provided on the right.
  • Figure 2: Overview of Yo'City. Global Planner: Converts the user prompt into a coarse city layout. Local Designer: Refines the layout into detailed, per-grid textual descriptions. 3D Generator: Synthesizes 3D assets for each grid by lifting isometric images. Expansion Module: Determines the content and optimal placement for new grids to evolve the city. Finally, all generated 3D assets are assembled into the complete city scene.
  • Figure 3: Qualitative comparison between our method and the baselines given the same city instructions. The red boxes highlight regions in SynCity that exhibit spatial inconsistency, lack of realism, and poor texture fidelity. We additionally provides zoom-in visualizations for Yo'City, demonstrating clearer structural coherence and finer visual details. More cases are shown in Fig. \ref{['fig:supplementary_material_case']}.
  • Figure 4: Visualization of expansion. The first row presents the city’s global instruction, expressed as a set of keywords. The leftmost city shows the initial generation result, followed by five successive expansion iterations. In the top-left corner, a BEV thumbnail depicts the city layout, with blue regions indicating newly expanded grids, while red boxes in the rendered images highlight their appearances.
  • Figure 5: VQAScore variations across expansion steps. The figure shows results for five cities, each undergoing four expansion steps to prove the stability of the expansion mechanism.
  • ...and 5 more figures