Table of Contents
Fetching ...

DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

Jingkai Sun, Gang Han, Pihai Sun, Wen Zhao, Jiahang Cao, Jiaxu Wang, Yijie Guo, Qiang Zhang

TL;DR

This work tackles terrain-aware humanoid locomotion using a single depth camera by integrating a terrain-aware policy with a blind backbone, a cross-attention transformer for reconstructing local terrain from partial depth and proprioception, and a realistic depth synthesis pipeline to bridge sim-to-real gaps. The three core components are trained end-to-end via teacher–student distillation and end-to-end fine-tuning, enabling robust locomotion without reliance on global localization. Empirical results on a full-sized humanoid demonstrate strong terrain generalization, reduced perception delay, and improved stability across slopes, stairs, gaps, and movable platforms, with measurable gains in reconstruction accuracy and stumble reduction. The approach advances practical perceptive locomotion by tightly coupling structured terrain reasoning with reinforcement learning and domain-randomized depth synthesis, yielding efficient training and robust real-world performance. These findings highlight the feasibility of depth-only perceptive locomotion for unstructured environments and offer a scalable path toward real-time, robust humanoid control.

Abstract

Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30\% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.

DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

TL;DR

This work tackles terrain-aware humanoid locomotion using a single depth camera by integrating a terrain-aware policy with a blind backbone, a cross-attention transformer for reconstructing local terrain from partial depth and proprioception, and a realistic depth synthesis pipeline to bridge sim-to-real gaps. The three core components are trained end-to-end via teacher–student distillation and end-to-end fine-tuning, enabling robust locomotion without reliance on global localization. Empirical results on a full-sized humanoid demonstrate strong terrain generalization, reduced perception delay, and improved stability across slopes, stairs, gaps, and movable platforms, with measurable gains in reconstruction accuracy and stumble reduction. The approach advances practical perceptive locomotion by tightly coupling structured terrain reasoning with reinforcement learning and domain-randomized depth synthesis, yielding efficient training and robust real-world performance. These findings highlight the feasibility of depth-only perceptive locomotion for unstructured environments and offer a scalable path toward real-time, robust humanoid control.

Abstract

Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30\% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.

Paper Structure

This paper contains 14 sections, 25 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of the proposed teacher–student distillation framework for humanoid perceptive locomotion.(A) The student policy interacts with the environment to generate actions, while a teacher policy provides supervision via an $\mathcal{L}_2$ loss. The distillation process transfers locomotion skills across diverse terrains. (B) A reconstruction module integrates proprioceptive history and depth information through a Transformer–based pipeline to refine terrain representations $\hat{H}^{\text{refine}}_t$, supervised by ground truth. The refined features are fed into the student policy, which is further optimized using distillation and adversarial reinforcement learning.
  • Figure 2: Ablation study of the proposed framework across four challenging terrains: Stair Up, Gap, Stair Down, and Hurdle. Success rate (bars) and traversing rate (lines) are reported for our full model and three ablated variants: without multi-teacher distillation, without blind backbone (w/o Backbone), and without gait phase and command adaptation in action (w/o Adaptation). X-axis is the difficulty of terrain. Results show that removing key components significantly degrades performance, particularly in traversing complex terrains such as gaps and hurdles, highlighting the effectiveness of the complete design.
  • Figure 3: The figure illustrates our physically grounded noise pipeline applied to synthetic depth images. From left to right: an idealized rendered image, occlusion added due to embodiment and camera angle, noise injection with boundary cropping, and a real-world depth image for comparison. The pipeline reproduces key visual artifacts observed in real sensors, including occlusion shadows, dropout, and structured noise, facilitating realistic sim-to-real transfer.
  • Figure 4: Visual comparison of reconstructed terrain (blue) corresponding to depth input and ground truth (red-to-blue). The top row illustrates reconstructed terrain and ground truth built by elevation maps. The bottom row presents the corresponding raw depth images.
  • Figure 5: Comparison between our reconstruction method and elevation map in a gap terrain scenario. The top figure shows a 3D reconstruction of the terrain: our method (blue) successfully reconstructs the full geometry of the gap, including its bottom, while the elevation map (red–blue) fails to capture the occluded region due to missing depth information. The bottom figure shows the raw depth image, where the gap bottom is entirely occluded. Our method infers and reconstructs this missing geometry, enabling robust locomotion planning across such challenging terrains.
  • ...and 3 more figures