Table of Contents
Fetching ...

MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou

TL;DR

MetaUrban tackles the need for scalable, safe AI in urban micromobility by introducing a compositional simulator that can generate infinite urban scenes with rich semantics, diverse agents, and realistic dynamics. Its three core modules—Hierarchical Layout Generation, Scalable Obstacle Retrieval, and Cohabitant Populating—coupled with the MetaUrban-12K dataset, provide a rigorous platform for evaluating PointNav and SocialNav under varied geometries, terrains, and pedestrian interactions. The results show strong generalization to unseen environments and highlight the balance between safety and performance, while cross-machine analyses reveal how mechanical design shapes policy learning. By open-sourcing the platform and dataset, MetaUrban aims to accelerate research on safe, trustworthy embodied AI for urban micromobility and inform future urban planning and robotics deployment.

Abstract

Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. Ensuring the generalizability and safety of AI models maneuvering mobile machines is essential. In this work, we present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and dataset will be publicly available.

MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

TL;DR

MetaUrban tackles the need for scalable, safe AI in urban micromobility by introducing a compositional simulator that can generate infinite urban scenes with rich semantics, diverse agents, and realistic dynamics. Its three core modules—Hierarchical Layout Generation, Scalable Obstacle Retrieval, and Cohabitant Populating—coupled with the MetaUrban-12K dataset, provide a rigorous platform for evaluating PointNav and SocialNav under varied geometries, terrains, and pedestrian interactions. The results show strong generalization to unseen environments and highlight the balance between safety and performance, while cross-machine analyses reveal how mechanical design shapes policy learning. By open-sourcing the platform and dataset, MetaUrban aims to accelerate research on safe, trustworthy embodied AI for urban micromobility and inform future urban planning and robotics deployment.

Abstract

Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. Ensuring the generalizability and safety of AI models maneuvering mobile machines is essential. In this work, we present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and dataset will be publicly available.
Paper Structure (53 sections, 2 equations, 26 figures, 10 tables)

This paper contains 53 sections, 2 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: MetaUrban enables the construction of infinite interactive urban scenes, supports multiple sensors, and offers flexible user interfaces such as a mouse, keyboard, joystick, and racing wheel. The platform includes 10,000 diverse obstacles in urban scenes, 1,100 rigged human models each with 2,314 movements, vulnerable road users, mobile machines with varied mechanical structures, and a terrain generation system to create complex ground conditions. We highly recommend visiting our project page for video demonstrations.
  • Figure 2: Motivation. (Top) Emerging automated micromobility. (Bottom) Unique challenges in micromobility.
  • Figure 3: Procedural generation. MetaUrban can automatically generate complex urban scenes with its compositional nature. From the second to the fourth column, the top row shows the 2D road maps, and the bottom row shows the bird-eye view of 3D scenes.
  • Figure 4: Ground plan. (Left) Sidewalk is divided into four functional zones -- building, frontage, clear, and furnishing zone. (Right) Seven typical sidewalk templates -- from (a) to (g).
  • Figure 5: Scalable obstacle retrieval. (a) Real-world distribution extraction. We get object distribution for urban spaces from three sources: academic datasets, Google Street data, and text description data. (b) Open-vocabulary search. We use the VLM to get image and text embedding, respectively. Then, based on the relevant scores, we can get the objects with high rankings.
  • ...and 21 more figures