Table of Contents
Fetching ...

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, Jiangmiao Pang

TL;DR

Gallant introduces a voxel-grid perception framework for humanoid locomotion in 3D constrained terrains, addressing the limitations of depth and elevation maps by preserving multi-layer scene structure. The approach uses a robot-centric voxel grid derived from LiDAR, processed with a z-grouped 2D CNN to produce perceptual features that feed an end-to-end PPO-based policy, trained in a high-fidelity LiDAR simulation with domain randomization and eight terrain families to enable zero-shot sim-to-real transfer. A full-stack pipeline—from LiDAR sensing and voxel processing to perception and control—enables a single policy to handle ground-level obstacles, lateral clutter, overhead constraints, and multi-level structures, achieving near-100% success in challenging tasks like stair climbing and elevated-platform stepping. Real-world deployments on a Unitree G1 demonstrate robust performance across diverse terrains without terrain-specific tuning, and ablations underscore the importance of dynamic LiDAR data, the z-grouped 2D CNN, and LiDAR-domain randomization for robust sim-to-real generalization. The work highlights a practical route to full-space perceptive locomotion by coupling lightweight perception with end-to-end optimization and a realistic sensor pipeline.

Abstract

Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near 100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

TL;DR

Gallant introduces a voxel-grid perception framework for humanoid locomotion in 3D constrained terrains, addressing the limitations of depth and elevation maps by preserving multi-layer scene structure. The approach uses a robot-centric voxel grid derived from LiDAR, processed with a z-grouped 2D CNN to produce perceptual features that feed an end-to-end PPO-based policy, trained in a high-fidelity LiDAR simulation with domain randomization and eight terrain families to enable zero-shot sim-to-real transfer. A full-stack pipeline—from LiDAR sensing and voxel processing to perception and control—enables a single policy to handle ground-level obstacles, lateral clutter, overhead constraints, and multi-level structures, achieving near-100% success in challenging tasks like stair climbing and elevated-platform stepping. Real-world deployments on a Unitree G1 demonstrate robust performance across diverse terrains without terrain-specific tuning, and ablations underscore the importance of dynamic LiDAR data, the z-grouped 2D CNN, and LiDAR-domain randomization for robust sim-to-real generalization. The work highlights a practical route to full-space perceptive locomotion by coupling lightweight perception with end-to-end optimization and a realistic sensor pipeline.

Abstract

Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near 100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.

Paper Structure

This paper contains 31 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview. Gallant enables a single policy with voxel grids to traverse diverse 3D constrained terrains: (a) ascend and descend stairs, (b) pass doors and duck under ceilings, (c) step onto platforms and over gaps, and (d) cross stepping-stone pillars.
  • Figure 2: Method Overview. (a) Curriculum-based training over 8 representative terrains enhances generalization. (b) Realistic voxel path alignment achieved via efficient LiDAR simulation with domain-randomized latency and noise. (c) A 2D CNN-based perceptual module processes voxel grid using the z-dimension as input channels, balancing efficiency and representation capability. (d) A latent-aware PPO policy enables zero-shot sim-to-real transfer across diverse obstacles, including ground, lateral, and overhead challenges.
  • Figure 3: Terrain types used to train robots in simulation($\mathbf{p}_\tau^{\max}$)
  • Figure 4: Humanoid robot traverses diverse 3D constrained terrains in both simulation and the real world. (a)Traversal across the eight simulated training terrain types. (b)Ducking under suspended ceiling obstacles. (c)Local navigation through lateral clutters. (d)Stepping onto a $30cm$-high platform and crossing a $40cm$ gap. (e)Traversing pile-like stepping-stone terrain. (f)(g)Ascending and descending $20cm$ stairs. All deployments are based on the same policy.
  • Figure 5: Visualization of simulation ablation analyses. (a) The humanoid crouches to traverse under a low ceiling; (b) Voxel grid from LiDAR simulation that includes dynamic objects captures the robot’s own links; (c) LiDAR simulation restricted to static objects excludes robot links from the voxel grid; (d) Mean training iteration time for Gallant with different CNN-based perception modules.
  • ...and 4 more figures