Table of Contents
Fetching ...

UrbanWorld: An Urban World Model for 3D City Generation

Yu Shang, Yuming Lin, Yu Zheng, Hangyu Fan, Jingtao Ding, Jie Feng, Jiansheng Chen, Li Tian, Yong Li

TL;DR

UrbanWorld presents a fully automatic urban world model that generates realistic, interactive 3D urban environments under flexible control. It couples map-based layout generation, a specialized Urban MLLM for scene design, diffusion-based texture rendering with depth-aware control, and MLLM-guided refinement to iteratively improve outputs. Quantitative analyses across five visual metrics indicate state-of-the-art realism, while qualitative results demonstrate broad controllability with text or image prompts and clear interactive capabilities for embodied agents. The work also provides an open-source toolset to foster research in embodied AI and AGI within urban contexts.

Abstract

Cities, as the essential environment of human life, encompass diverse physical elements such as buildings, roads and vegetation, which continuously interact with dynamic entities like people and vehicles. Crafting realistic, interactive 3D urban environments is essential for nurturing AGI systems and constructing AI agents capable of perceiving, decision-making, and acting like humans in real-world environments. However, creating high-fidelity 3D urban environments usually entails extensive manual labor from designers, involving intricate detailing and representation of complex urban elements. Therefore, accomplishing this automatically remains a longstanding challenge. Toward this problem, we propose UrbanWorld, the first generative urban world model that can automatically create a customized, realistic and interactive 3D urban world with flexible control conditions. UrbanWorld incorporates four key stages in the generation pipeline: flexible 3D layout generation from OSM data or urban layout with semantic and height maps, urban scene design with Urban MLLM, controllable urban asset rendering via progressive 3D diffusion, and MLLM-assisted scene refinement. We conduct extensive quantitative analysis on five visual metrics, demonstrating that UrbanWorld achieves SOTA generation realism. Next, we provide qualitative results about the controllable generation capabilities of UrbanWorld using both textual and image-based prompts. Lastly, we verify the interactive nature of these environments by showcasing the agent perception and navigation within the created environments. We contribute UrbanWorld as an open-source tool available at https://github.com/Urban-World/UrbanWorld.

UrbanWorld: An Urban World Model for 3D City Generation

TL;DR

UrbanWorld presents a fully automatic urban world model that generates realistic, interactive 3D urban environments under flexible control. It couples map-based layout generation, a specialized Urban MLLM for scene design, diffusion-based texture rendering with depth-aware control, and MLLM-guided refinement to iteratively improve outputs. Quantitative analyses across five visual metrics indicate state-of-the-art realism, while qualitative results demonstrate broad controllability with text or image prompts and clear interactive capabilities for embodied agents. The work also provides an open-source toolset to foster research in embodied AI and AGI within urban contexts.

Abstract

Cities, as the essential environment of human life, encompass diverse physical elements such as buildings, roads and vegetation, which continuously interact with dynamic entities like people and vehicles. Crafting realistic, interactive 3D urban environments is essential for nurturing AGI systems and constructing AI agents capable of perceiving, decision-making, and acting like humans in real-world environments. However, creating high-fidelity 3D urban environments usually entails extensive manual labor from designers, involving intricate detailing and representation of complex urban elements. Therefore, accomplishing this automatically remains a longstanding challenge. Toward this problem, we propose UrbanWorld, the first generative urban world model that can automatically create a customized, realistic and interactive 3D urban world with flexible control conditions. UrbanWorld incorporates four key stages in the generation pipeline: flexible 3D layout generation from OSM data or urban layout with semantic and height maps, urban scene design with Urban MLLM, controllable urban asset rendering via progressive 3D diffusion, and MLLM-assisted scene refinement. We conduct extensive quantitative analysis on five visual metrics, demonstrating that UrbanWorld achieves SOTA generation realism. Next, we provide qualitative results about the controllable generation capabilities of UrbanWorld using both textual and image-based prompts. Lastly, we verify the interactive nature of these environments by showcasing the agent perception and navigation within the created environments. We contribute UrbanWorld as an open-source tool available at https://github.com/Urban-World/UrbanWorld.
Paper Structure (18 sections, 4 equations, 8 figures, 4 tables)

This paper contains 18 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the whole framework of UrbanWorld, including four key components: (A) Flexible 3D urban layout generation; (B) Urban MLLM-empowered scene design; (C) Diffusion-based urban asset texture rendering; (D) MLLM-assisted scene refinement.
  • Figure 2: Illustration of the urban asset rendering method in UrbanWorld, mainly including two stages: depth-aware UV texture generation with flexible control under textual and visual prompts and UV position-aware texture refinement.
  • Figure 3: Illustration of the evolution of created urban environments, including the untextured urban scene, initial textured urban scene and refined urban scene.
  • Figure 4: Qualitative comparisons of generated 3D urban environments from Infinicity, CityGen, CityDreamer and UrbanWorld. By comparison, our method can craft more diverse and realistic 3D urban environments enabling dynamic interactions with humans (walking) and vehicles (driving).
  • Figure 5: Illustration of the controllable generation of diverse architecture styles when prompting with reference images (upper left) in UrbanWorld.
  • ...and 3 more figures