Learning Generic and Dynamic Locomotion of Humanoids Across Discrete Terrains

Shangqun Yu; Nisal Perera; Daniel Marew; Donghyun Kim

Learning Generic and Dynamic Locomotion of Humanoids Across Discrete Terrains

Shangqun Yu, Nisal Perera, Daniel Marew, Donghyun Kim

TL;DR

The paper tackles terrain-adaptive dynamic humanoid locomotion by marrying a reinforcement-learned Basal Ganglia policy (BG-policy) with a state-of-the-art optimization-based motion controller (MPC+WBIC). Trained in a simplified 2D environment, the BG-policy makes high-level decisions (gait type, step locations, forward speed) that are resolved into feasible trajectories by convex optimization and executed by a real-time WBIC controller, enabling walking, jumping, and leaping across discrete terrains with significantly fewer samples (~1e6) and zero-shot transfer to different humanoid platforms. Key contributions include a data-efficient RL framework, a robot-agnostic control architecture, and demonstrated robustness across various robots and tasks, including omni-directional walking. Practical impact lies in enabling terrain-aware dynamic locomotion on humanoids with reduced training data and improved cross-platform applicability, potentially accelerating deployment in real-world robotic systems.

Abstract

This paper addresses the challenge of terrain-adaptive dynamic locomotion in humanoid robots, a problem traditionally tackled by optimization-based methods or reinforcement learning (RL). Optimization-based methods, such as model-predictive control, excel in finding optimal reaction forces and achieving agile locomotion, especially in quadruped, but struggle with the nonlinear hybrid dynamics of legged systems and the real-time computation of step location, timing, and reaction forces. Conversely, RL-based methods show promise in navigating dynamic and rough terrains but are limited by their extensive data requirements. We introduce a novel locomotion architecture that integrates a neural network policy, trained through RL in simplified environments, with a state-of-the-art motion controller combining model-predictive control (MPC) and whole-body impulse control (WBIC). The policy efficiently learns high-level locomotion strategies, such as gait selection and step positioning, without the need for full dynamics simulations. This control architecture enables humanoid robots to dynamically navigate discrete terrains, making strategic locomotion decisions (e.g., walking, jumping, and leaping) based on ground height maps. Our results demonstrate that this integrated control architecture achieves dynamic locomotion with significantly fewer training samples than conventional RL-based methods and can be transferred to different humanoid platforms without additional training. The control architecture has been extensively tested in dynamic simulations, accomplishing terrain height-based dynamic locomotion for three different robots.

Learning Generic and Dynamic Locomotion of Humanoids Across Discrete Terrains

TL;DR

Abstract

Paper Structure (14 sections, 23 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 23 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Related work
Contact Implicit Trajectory Optimization (CI-TO)
Training efficiency of Reinforcement Learning
Basal Ganglia Policy Training
Optimization-based Motion Controller
Model Predictive Control
Lateral Directional Step Location Selection
Whole Body Impulse Control (WBIC)
Results
Experiment and Evaluation
Benchmark
Algorithm's Robustness
Concluding Remarks

Figures (9)

Figure 1: The proposed learning framework and control architecture. (a) We first train a policy using a single rigid body dynamics (SRBD) model in the sagittal plane and trajectory optimization. (b) The trained policy (BG-policy) is integrated into the motion controller (MPC + WBIC) to compute the final joint commands for a humanoid robot. (c) Our control architecture commands a humanoid robot to walk, leap over gaps, jump onto platforms, and navigate stairs based on vision-based data.
Figure 2: State Transition in the 2D Environment. We designed a 2D environment which makes the policy focus on only the information that matters. The simplification leads to exceptional efficient training of the policy. This environment enables the policy to be effectively trained using no more than 1 million samples, a quantity several magnitude smaller than what is typically required in end-to-end methods with vision based data.
Figure 3: Three different gaits All gaits have the same length with different contact sequence.
Figure 4: Illustration of how to maintain constant prediction horizon for MPC. Between two actions, the BG-policy uses prediction from the MPC to compute the output, which ensure the MPC to have sufficiently long contact sequence to keep its constant prediction horizon.
Figure 5: Training performance and Validation Courses. (a) Both SAC and PPO are trained for 1 million steps in 5 different seeds. The average return plot shows outstanding performance of SAC algorithm. The average success rate of the policy trained by SAC is also shown as orange bar. Once the iteration reaches to 1 million steps, the policy gets converged and shows robust locomotion performance. (b) To evaluate the actual performance of the trained policy, we made 30 validation courses with randomly generated obstacles in the full dynamic simulator.
...and 4 more figures

Learning Generic and Dynamic Locomotion of Humanoids Across Discrete Terrains

TL;DR

Abstract

Learning Generic and Dynamic Locomotion of Humanoids Across Discrete Terrains

Authors

TL;DR

Abstract

Table of Contents

Figures (9)