Table of Contents
Fetching ...

Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng

TL;DR

The paper addresses the challenge of reproducible benchmarking for open-world embodied AI, showing that video-based 3DGS methods suffer from weak geometric grounding and unreliable view synthesis. It introduces Wanderland, a real-to-sim framework that fuses multi-sensor capture with LIV-SLAM reconstruction and 3D Gaussian Splatting, yielding metric-scale geometry and photorealistic rendering integrated into USD scenes for Isaac Sim. The authors provide Wanderland16, a large-scale indoor–outdoor urban dataset with rich sensor data and navigation benchmarks, and demonstrate that geometric grounding improves novel-view synthesis and navigation policy reliability while vision-only pipelines lag behind. By establishing a robust, geometry-grounded simulation platform and dataset, Wanderland offers a foundation for reproducible open-world embodied AI research and benchmarking across perception, planning, and navigation tasks.

Abstract

Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.

Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

TL;DR

The paper addresses the challenge of reproducible benchmarking for open-world embodied AI, showing that video-based 3DGS methods suffer from weak geometric grounding and unreliable view synthesis. It introduces Wanderland, a real-to-sim framework that fuses multi-sensor capture with LIV-SLAM reconstruction and 3D Gaussian Splatting, yielding metric-scale geometry and photorealistic rendering integrated into USD scenes for Isaac Sim. The authors provide Wanderland16, a large-scale indoor–outdoor urban dataset with rich sensor data and navigation benchmarks, and demonstrate that geometric grounding improves novel-view synthesis and navigation policy reliability while vision-only pipelines lag behind. By establishing a robust, geometry-grounded simulation platform and dataset, Wanderland offers a foundation for reproducible open-world embodied AI research and benchmarking across perception, planning, and navigation tasks.

Abstract

Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.

Paper Structure

This paper contains 24 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Do video-3DGS frameworks provide geometrically grounded and photorealistic simulation? We demonstrate that building such simulations from casually captured touring videos often fails due to limited view diversity, inaccurate 3D reconstruction, unreliable geometry extraction, and degraded novel-view extrapolation. We propose the Wanderland16,185,12976,29,149 framework that features multi-sensor diverse-view capture, reliable reconstruction, accurate metric-scale geometry, and robust view synthesis.
  • Figure 2: The MetaCam device. (a) SkylandX MetaCam Air used for data collection, equipped with a companion app for mobile capture. (b) Working frequency of each sensor.
  • Figure 3: Data collection trajectory. To facilitate both diverse-view capture and evaluation of extrapolated views in navigation, our data is collected with well-defined training and extrapolation splits. Training views ensure accurate reconstruction, while extrapolation views are used for evaluation.
  • Figure 4: Data Processing Pipeline. Our pipeline begins with multi-sensor capture using the MetaCam device in real-world urban spaces. MetaCam Studio processes the raw data via LIV-SLAM to produce a colorized, globally consistent metric point cloud and accurate camera poses. We then initialize 3D Gaussians from the metric point cloud and render per-view depth maps from this initialization. The 3DGS model is optimized with both photometric and depth losses. In parallel, we extract a reliable collision mesh from the same global point cloud. Finally, we integrate the trained 3DGS model and the collision mesh into a single Universal Scene Description (USD) scene, which can be directly loaded into Isaac Sim for training and evaluating navigation policies
  • Figure 5: Mesh qualitative comparison. All results are reconstructed from the same data in the Wanderland16,185,12976,29,149 dataset. Our framework extracts complete and smooth mesh.
  • ...and 6 more figures