Table of Contents
Fetching ...

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

TL;DR

HuMam addresses the challenge of stable, efficient end-to-end reinforcement learning for humanoid locomotion by introducing a lightweight, state-centric fusion backbone based on a single-layer Mamba encoder. The method fuses robot-centric states with externally planned footsteps and a gait-phase clock, producing compact embeddings used by a PPO-trained policy that outputs joint-position targets tracked by a low-gain PD controller. A six-term PPO reward balances contact quality, swing smoothness, foot placement, posture, height, and upper-body stability to yield energy-efficient, stable gaits. Across forward, backward, lateral, curved walking, and standing tasks on the JVRC-1 in mc-mujoco, HuMam yields faster learning, better stability, and reduced torque and energy consumption compared with a feedforward baseline, validating Mamba as an effective backbone for compact, end-to-end humanoid control with practical impact for real-world deployment.

Abstract

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

TL;DR

HuMam addresses the challenge of stable, efficient end-to-end reinforcement learning for humanoid locomotion by introducing a lightweight, state-centric fusion backbone based on a single-layer Mamba encoder. The method fuses robot-centric states with externally planned footsteps and a gait-phase clock, producing compact embeddings used by a PPO-trained policy that outputs joint-position targets tracked by a low-gain PD controller. A six-term PPO reward balances contact quality, swing smoothness, foot placement, posture, height, and upper-body stability to yield energy-efficient, stable gaits. Across forward, backward, lateral, curved walking, and standing tasks on the JVRC-1 in mc-mujoco, HuMam yields faster learning, better stability, and reduced torque and energy consumption compared with a feedforward baseline, validating Mamba as an effective backbone for compact, end-to-end humanoid control with practical impact for real-world deployment.

Abstract

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

Paper Structure

This paper contains 24 sections, 23 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overall architecture of the proposed humanoid locomotion framework. At each time step, robot-centric and external states are collected as observations and projected into a latent embedding. A single-layer Mamba encoder processes these features to produce compact representations for the policy and value heads, which are optimized using PPO. A hierarchical control structure is adopted, where the high-level policy outputs desired joint positions and a low-gain PD controller converts them into executable joint torques. The reward design combines foot-level and body-level objectives to encourage stable and natural gaits.
  • Figure 2: Simulated environments that the robot is trained and evaluated. Panels (a)–(e): (a) Walking straight forward; (b) Walking straight backward; (c) Walking on a curved path; (d) Standing in place; (e) Lateral Walking.
  • Figure 3: Training curves of HuMam and Baseline across scenarios. Solid lines denote the mean episode return across seeds, while shaded regions indicate the standard deviation.
  • Figure 4: Foot trajectory of lateral walking.
  • Figure 5: Foot trajectory of backward walking.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark 1