Table of Contents
Fetching ...

Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion

Ho Jae Lee, Seungwoo Hong, Sangbae Kim

TL;DR

The paper tackles robust dynamic legged locomotion by fusing a physics-based step planner grounded in the $3$D-LIPM with a model-free PPO policy. The planner generates target foot placements via ICP trajectories from velocity commands, while the RL policy learns to track these placements and maintain balance, enabling exploration beyond the simplified model. On the MIT Humanoid, the approach achieves stable forward walking up to $1.5$ m/s and performs dynamic turning, with demonstrated generalization to unseen rough and gap terrains and successful sim-to-real transfer. This method offers improved velocity tracking and adaptability by leveraging physics-informed guidance without overfitting to a template model, indicating practical potential for real-world legged locomotion.

Abstract

In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.

Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion

TL;DR

The paper tackles robust dynamic legged locomotion by fusing a physics-based step planner grounded in the D-LIPM with a model-free PPO policy. The planner generates target foot placements via ICP trajectories from velocity commands, while the RL policy learns to track these placements and maintain balance, enabling exploration beyond the simplified model. On the MIT Humanoid, the approach achieves stable forward walking up to m/s and performs dynamic turning, with demonstrated generalization to unseen rough and gap terrains and successful sim-to-real transfer. This method offers improved velocity tracking and adaptability by leveraging physics-informed guidance without overfitting to a template model, indicating practical potential for real-world legged locomotion.

Abstract

In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.
Paper Structure (11 sections, 18 equations, 10 figures, 3 tables)

This paper contains 11 sections, 18 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Our control hierarchy that employs a 3D-LIPM to determine the desired footstep location for locomotion. We train an RL policy to track the given steps and deploy the policy on MIT Humanoid.
  • Figure 2: Step pattern generation algorithms for 3D-LIPM from 3D (Figure \ref{['fig:lip_3d']}), 2D top-view (Figure \ref{['fig:lip_2d']}) perspective. Figure \ref{['fig:lip_3d']} depicts the LIPM with two legs. The LIP dynamics can predict the CoM trajectory (green lines, and green dashed lines). Our method calculates ICP trajectory (yellow lines) and adds offsets $(b_x, b_y)$ to the final ICP $(\xi_x^{\text{f}}, \xi_y^{\text{f}})$ to determine desired step locations for tracking velocity commands. Figure \ref{['fig:lip_2d']} depicts the top view of the proposed method.
  • Figure 3: Overall control diagram and training framework for both learning in simulation and deployment to hardware. Step pattern generation algorithms generate the desired step location by utilizing the robot's CoM position, velocity, and foot states. These algorithms and NN update at a frequency of 100 Hz where both the actor and critic are trained using the PPO algorithm. Once the policy (actor) outputs joint position targets, the joint PD controller is evaluated at 1 KHz, and the command torques are sent to the motor.
  • Figure 4: Comparison of velocity tracking performance between our method, End-to-End policies trained on flat terrain versus mixed terrains (flat, rough, and gap), and Raibert heuristic policy. Commands were given in flat terrain. Our method exceeds the performance of the End-to-End approach trained on varied terrains and shows comparable results to the End-to-End policy trained exclusively on flat terrain.
  • Figure 5: Learned foot contact schedule. The step duration $T_{\text{s}}$ that was set to 0.35 seconds was encouraged by a contact schedule reward. The error between the measured and desired foot contact schedule is less than 0.01 seconds.
  • ...and 5 more figures