Table of Contents
Fetching ...

NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation

Jiahang Liu, Yuanxing Duan, Jiazhao Zhang, Minghan Li, Shaoan Wang, Zhizheng Zhang, He Wang

Abstract

Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSim's effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA model's scene understanding, enabling the policy to handle diverse navigation queries effectively.

NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation

Abstract

Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSim's effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA model's scene understanding, enabling the policy to handle diverse navigation queries effectively.
Paper Structure (19 sections, 18 equations, 9 figures, 3 tables)

This paper contains 19 sections, 18 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: NavGSim's agent-simulator architecture. Users interact with the system through an upper-level python API interface, which provides access to the core functionalities of NavGSim.
  • Figure 2: NavGSim's sample code
  • Figure 3: Illustration of Gaussian Slicing for Navigable Areas. The left image shows the 3D Sofa model projected onto multiple horizontal Z-planes ($Z_i$, $Z_{i+1}$, $Z_{i+2}$) at fixed height intervals. Each plane contains a 2D Gaussian slice that represents the projection of the Sofa at that particular height. These slices are combined to form a 2D collision map, with the right image showing the entire scene's 2-D occupancy map that serves as geometric prior information for downstream tasks such as navigation cost-map construction or collision checking.
  • Figure 4: VLA-NavGSim architecture. Our VLA model takes multiple frames of images and language instructions as input. The images are processed by a Vision Encoder to extract visual features. The resulting visual tokens, along with the language tokens, are fed into the LLM to generate a predicted token. This token is then passed through a lightweight MLP to generate a trajectory.
  • Figure 5: Visualisation of reconstructed Gaussian Splatting and rendering examples. (Left) We illustrate the process of subdividing the scene into spatial chunks using H3DGS for efficient reconstruction. (Right) We compare rendering results of NavGsim with real-world image at seven distinct viewpoints (A-F).
  • ...and 4 more figures