Table of Contents
Fetching ...

Dynamic Bipedal Maneuvers through Sim-to-Real Reinforcement Learning

Fangzhou Yu, Ryan Batke, Jeremy Dao, Jonathan Hurst, Kevin Green, Alan Fern

TL;DR

The paper tackles enabling highly dynamic, aperiodic bipedal maneuvers and reliable sim-to-real transfer by training recurrent turning policies from offline SRBM trajectory data. It introduces an epilogue reward mechanism to ensure smooth transitions back to nominal walking after a turn, and leverages dynamics randomization to bridge the sim-to-real gap. By comparing four reward formulations, it analyzes how reference information shapes learning and turning performance, demonstrating successful sim-to-real transfer on Cassie for four-step 90-degree turns. The study highlights both the promise and current hardware challenges, emphasizing the need for scalable switching among many behavior policies for more complex dynamic routines.

Abstract

For legged robots to match the athletic capabilities of humans and animals, they must not only produce robust periodic walking and running, but also seamlessly switch between nominal locomotion gaits and more specialized transient maneuvers. Despite recent advancements in controls of bipedal robots, there has been little focus on producing highly dynamic behaviors. Recent work utilizing reinforcement learning to produce policies for control of legged robots have demonstrated success in producing robust walking behaviors. However, these learned policies have difficulty expressing a multitude of different behaviors on a single network. Inspired by conventional optimization-based control techniques for legged robots, this work applies a recurrent policy to execute four-step, 90 degree turns trained using reference data generated from optimized single rigid body model trajectories. We present a novel training framework using epilogue terminal rewards for learning specific behaviors from pre-computed trajectory data and demonstrate a successful transfer to hardware on the bipedal robot Cassie.

Dynamic Bipedal Maneuvers through Sim-to-Real Reinforcement Learning

TL;DR

The paper tackles enabling highly dynamic, aperiodic bipedal maneuvers and reliable sim-to-real transfer by training recurrent turning policies from offline SRBM trajectory data. It introduces an epilogue reward mechanism to ensure smooth transitions back to nominal walking after a turn, and leverages dynamics randomization to bridge the sim-to-real gap. By comparing four reward formulations, it analyzes how reference information shapes learning and turning performance, demonstrating successful sim-to-real transfer on Cassie for four-step 90-degree turns. The study highlights both the promise and current hardware challenges, emphasizing the need for scalable switching among many behavior policies for more complex dynamic routines.

Abstract

For legged robots to match the athletic capabilities of humans and animals, they must not only produce robust periodic walking and running, but also seamlessly switch between nominal locomotion gaits and more specialized transient maneuvers. Despite recent advancements in controls of bipedal robots, there has been little focus on producing highly dynamic behaviors. Recent work utilizing reinforcement learning to produce policies for control of legged robots have demonstrated success in producing robust walking behaviors. However, these learned policies have difficulty expressing a multitude of different behaviors on a single network. Inspired by conventional optimization-based control techniques for legged robots, this work applies a recurrent policy to execute four-step, 90 degree turns trained using reference data generated from optimized single rigid body model trajectories. We present a novel training framework using epilogue terminal rewards for learning specific behaviors from pre-computed trajectory data and demonstrate a successful transfer to hardware on the bipedal robot Cassie.
Paper Structure (24 sections, 2 equations, 8 figures, 3 tables)

This paper contains 24 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: A Cassie robot executing a four-step 90° right turn. (Top Row) Hardware field test of the full-reference turning policy initialized from a commanded heading speed of 2.0m/s on artificial turf. (Bottom Row) Cassie running the full-reference turning policy in simulation initialized from a target heading speed of 2.5 m/s.
  • Figure 2: Plot of the reference trajectory for a 2.5 m/s, four-step turn from the optimized single rigid-body model moving left to right. The thick line represents the center of mass path, with different colors showing the different stance phases. Thin lines show leg positions at the start and end of stance phases.
  • Figure 3: Visualization of a PPO rollout during training. After being initialized from a $\pi^{walk}$ pose, $\pi^{turn}$ is evaluated until the end of the turning maneuver. If $\pi^{turn}$ completed the turning maneuver, $\pi^{walk}$ subsequently takes over to generate the epilogue reward.
  • Figure 4: Comparison of sample efficiency for our proposed turning policies. Note that the absolute scale of the different curves are not necessarily comparable since each reward function include different reward components. The star symbols mark the time to convergence for each policy, which is the point on the learning curve that exceeds 97$\%$ of the maximum reward seen during training for the first time.
  • Figure 5: Plot of footstep touchdown locations and pelvis trajectory for the reference data, Full Reference and No Reference policies for a turning maneuver executed at 2.5m/s.
  • ...and 3 more figures