Table of Contents
Fetching ...

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, Bolei Zhou

TL;DR

Vid2Sim tackles the persistent sim-to-real gap in urban navigation by turning monocular videos into photorealistic, physically interactive digital twins. It combines geometry-consistent Gaussian Splatting reconstruction with a hybrid GS+mesh representation to provide realistic observations and robust physical interactions for RL training, along with screen-space covariance culling to preserve visual fidelity. The approach yields substantial gains in navigation success and zero-shot sim-to-real transfer in real-world deployments, outperforming traditional mesh-based pipelines and broadening the scalability of simulation-based embodied AI. By enabling diverse real-world scenes, weather effects, and dynamic obstacles, Vid2Sim offers a practical, scalable pathway to train generalizable urban navigation policies with reduced sim-to-real gap.

Abstract

Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photorealistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods.

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

TL;DR

Vid2Sim tackles the persistent sim-to-real gap in urban navigation by turning monocular videos into photorealistic, physically interactive digital twins. It combines geometry-consistent Gaussian Splatting reconstruction with a hybrid GS+mesh representation to provide realistic observations and robust physical interactions for RL training, along with screen-space covariance culling to preserve visual fidelity. The approach yields substantial gains in navigation success and zero-shot sim-to-real transfer in real-world deployments, outperforming traditional mesh-based pipelines and broadening the scalability of simulation-based embodied AI. By enabling diverse real-world scenes, weather effects, and dynamic obstacles, Vid2Sim offers a practical, scalable pathway to train generalizable urban navigation policies with reduced sim-to-real gap.

Abstract

Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photorealistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods.
Paper Structure (22 sections, 9 equations, 12 figures, 5 tables)

This paper contains 22 sections, 9 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Vid2Sim converts monocular video captured by a hand-held camera into realistic and interactive 3D simulation environments. It facilitates RL training of navigation agents in digital twins of urban scenes and provides realistic observations like RGB and depth to reduce the sim-to-real gap. The pink mobile robot in the image is a food delivery bot that avoids collisions with pedestrians and obstacles.
  • Figure 2: Vid2Sim framework consists of three key stages: (1) Geometry-consistent reconstruction for high-quality environment creation, (2) building a realistic and interactive simulation with hybrid scene representation and diverse obstacle and scene augmentation for visual navigation training, and (3) Sim2Real validation through real-world deployment.
  • Figure 3: Interactive Scene Composition with Vid2Sim: Our method is able to combine reconstructed environments with 3D assets to create diverse simulation scenarios. Here we show the (a) original real2sim environment, (b) interactive scene composition with static and dynamic obstacles, (c) scene mesh for physical collision detection, (d) agent's RGB observations, and (e) depth rendering from our hybrid scene representation that could serve as an extra sensory modality.
  • Figure 4: Scene augmentation with various overall stylization
  • Figure 5: Surface normal renderings of different methods: results show that our approach reconstructs scene geometry with finer surface details and less artifacts compared to 3DGS kerbl3Dgaussians, Video2Game xia2024video2game, and 2DGS Huang2DGS2024.
  • ...and 7 more figures