Table of Contents
Fetching ...

CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Yuan Hao, Ruiqi Yu, Shixin Luo, Guoteng Zhang, Jun Wu, Qiuguo Zhu

Abstract

Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.

CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

Abstract

Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.

Paper Structure

This paper contains 21 sections, 23 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 3: Our CReF framework enables robust real-world humanoid locomotion, including more than 20 consecutive stair traversals, a 40 cm high platform, an 80 cm gap, and real-world stairs with a 20 cm rise and 26 cm tread, while generalizing robustly beyond the training terrains.
  • Figure 4: Overview of CReF. The proposed single-stage depth-conditioned policy combines cross-modal attention, gated residual fusion, recurrent fusion, and a terrain-aware foothold placement reward for robust terrain locomotion.
  • Figure 5: Simulation terrains used in the experiments. The upper panel shows representative terrains used during training. The lower panel shows additional MuJoCo out-of-distribution terrains used for cross-simulator evaluation.
  • Figure 6: Stair foothold distribution comparison between the proposed foothold placement reward and the FCQR baseline. The proposed reward produces more concentrated and repeatable touchdown distributions in both ascent and descent, reduces touchdown deviation within the stair tread, and eliminates the logged ankle-riser collisions in ascending rollouts.
  • Figure 7: Representative real-world rollouts of CReF across multiple terrains and deployment scenes. The figure includes stair traversal with side railings, entrance-step and platform-like transitions, outdoor pathways, and other real-world terrain configurations. Red boxes highlight examples where environmental factors introduce out-of-distribution depth observations, such as large invalid regions in the sensed depth image.
  • ...and 1 more figures