DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction
Jingkai Sun, Gang Han, Pihai Sun, Wen Zhao, Jiahang Cao, Jiaxu Wang, Yijie Guo, Qiang Zhang
TL;DR
This work tackles terrain-aware humanoid locomotion using a single depth camera by integrating a terrain-aware policy with a blind backbone, a cross-attention transformer for reconstructing local terrain from partial depth and proprioception, and a realistic depth synthesis pipeline to bridge sim-to-real gaps. The three core components are trained end-to-end via teacher–student distillation and end-to-end fine-tuning, enabling robust locomotion without reliance on global localization. Empirical results on a full-sized humanoid demonstrate strong terrain generalization, reduced perception delay, and improved stability across slopes, stairs, gaps, and movable platforms, with measurable gains in reconstruction accuracy and stumble reduction. The approach advances practical perceptive locomotion by tightly coupling structured terrain reasoning with reinforcement learning and domain-randomized depth synthesis, yielding efficient training and robust real-world performance. These findings highlight the feasibility of depth-only perceptive locomotion for unstructured environments and offer a scalable path toward real-time, robust humanoid control.
Abstract
Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30\% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.
