Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun; Yongbo Su; Leoric Huang; Alex Zhang; Dwyane Wei; Mu San; Daniel Tian; Ellie Cao; Finn Yan; Ethan Xie; Zongwu Xie

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Finn Yan, Ethan Xie, Zongwu Xie

TL;DR

This work tackles the core challenges of vision-based humanoid locomotion by addressing perception noise from the sim-to-real gap and the difficulty of learning a single policy over diverse terrains. It introduces a two-stage framework: (i) privileged reinforcement learning using height scans with terrain-specific rewards and a multi-critic/multi-discriminator setup, and (ii) vision-aware distillation that transfers knowledge to a depth-based deployment policy using a comprehensive depth augmentation pipeline. The key contributions are a realistic depth sensor simulation that reproduces stereo artifacts and calibration variability, terrain-aware learning signals with motion priors, and a distillation mechanism that combines denoising and latent-regularization to robustly transfer to real-depth inputs. Empirical results on two humanoid platforms show strong sim-to-real performance across extreme and fine-grained tasks, plus real-world deployment with high success and low power degradation, indicating practical viability for robust visual locomotion in complex environments.

Abstract

Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type. We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

TL;DR

Abstract

Paper Structure (50 sections, 15 equations, 7 figures, 16 tables)

This paper contains 50 sections, 15 equations, 7 figures, 16 tables.

Introduction
Related Work
Perceptive Legged Locomotion
Data Augmentation in Visual Reinforcement Learning
Method
Realistic Depth Sensor Simulation
Stereo Depth Fusion
Depth-Dependent Noise
Structured Noise Patterns
Optical Distortions
Calibration Uncertainties
Preprocessing
Privileged Reinforcement Learning
Height Scan Observations
Terrain-Specific Reward Shaping
...and 35 more sections

Figures (7)

Figure 1: Diverse terrain types used during training. Each terrain type contains 20 difficulty levels for curriculum learning.
Figure 2: Visualization of the depth augmentation pipeline. Starting from clean left and right depth images, the pipeline sequentially applies: (1) stereo fusion, (2) random convolution, (3) Gaussian noise, (4) Perlin noise, (5) scale randomization, (6) zero pixel failures, (7) max pixel failures, (8) depth clipping and spatial cropping to produce realistic depth observations for sim-to-real transfer.
Figure 3: Method Overview. Our framework consists of two stages: (1) Privileged RL Training: A teacher policy is trained with height scan observations using multi-critic and multi-discriminator learning, where terrain-specific reward shaping and dedicated value networks handle diverse terrain categories (stairs/platforms, gaps, rough terrain). (2) Vision-Aware Distillation: The privileged policy is distilled into a deployment policy operating on augmented depth images, combining behavior cloning with denoising objectives for robust sim-to-real transfer.
Figure 4: Real-world deployment sequences demonstrating stair traversal. Top row: ascending stairs with anticipatory leg lifting. Bottom row: descending stairs with controlled foot placement. The policy executes smooth gait patterns without any real-world fine-tuning.
Figure 5: t-SNE visualization of the depth encoder's latent space across six terrain types. Each terrain forms a distinct cluster, demonstrating effective terrain-specific representation learning despite realistic sensor noise.
...and 2 more figures

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

TL;DR

Abstract

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Authors

TL;DR

Abstract

Table of Contents

Figures (7)