Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

Zhongyu Li; Xue Bin Peng; Pieter Abbeel; Sergey Levine; Glen Berseth; Koushil Sreenath

Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, Koushil Sreenath

TL;DR

This work tackles the challenge of dynamic bipedal jumping by introducing a goal-conditioned reinforcement learning framework trained in simulation with extensive dynamics randomization. A novel policy architecture that fuses long-term I/O history via a 1D-CNN with direct access to short-term history enables end-to-end learning and robust zero-shot transfer to a real Cassie robot. A three-stage training pipeline—from single-goal imitation to multi-goal robustness to dynamics randomization—produces a versatile jumping policy capable of standing long jumps, elevated-platform landings, and multi-axes maneuvers, even under perturbations. The results demonstrate substantial real-world robustness and emergent contact strategies without explicit contact sequencing or perception, marking a step toward more capable, perception-free legged locomotion.

Abstract

This work aims to push the limits of agility for bipedal robots by enabling a torque-controlled bipedal robot to perform robust and versatile dynamic jumps in the real world. We present a reinforcement learning framework for training a robot to accomplish a large variety of jumping tasks, such as jumping to different locations and directions. To improve performance on these challenging tasks, we develop a new policy structure that encodes the robot's long-term input/output (I/O) history while also providing direct access to a short-term I/O history. In order to train a versatile jumping policy, we utilize a multi-stage training scheme that includes different training stages for different objectives. After multi-stage training, the policy can be directly transferred to a real bipedal Cassie robot. Training on different tasks and exploring more diverse scenarios lead to highly robust policies that can exploit the diverse set of learned maneuvers to recover from perturbations or poor landings during real-world deployment. Such robustness in the proposed policy enables Cassie to succeed in completing a variety of challenging jump tasks in the real world, such as standing long jumps, jumping onto elevated platforms, and multi-axes jumps.

Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

TL;DR

Abstract

Paper Structure (37 sections, 1 equation, 12 figures, 3 tables)

This paper contains 37 sections, 1 equation, 12 figures, 3 tables.

Introduction
Objective of this Paper
Contributions
Related Work
Model-based optimal control for legged jumping
Model-free RL for legged locomotion control
Sim-to-real transfer for legged robots
Background and Preliminaries
Floating-base Model of Cassie
RL Background and Goal-Conditioned Policy
Multi-Stage Training for Versatile Jumps
Overview of the Multi-Stage Training Schematic
Reference Motion
Reward
Episode Design
...and 22 more sections

Figures (12)

Figure 2: The schematic to train the robot to perform versatile jumping skills in the real world starting with a reference motion of a single jumping animation. This framework consists of three stages. In the first stage, we focus on training the robot to imitate the animation while performing a single jump from scratch. After the robot is good at achieving the single goal, we randomize the goal (to land at different locations and different turning directions/elevations) that is assigned to the robot during each training episode. After these two stages of training, we extensively randomize the dynamics properties of the environment in simulation in order to improve the robustness of the robot during the zero-shot transfer from sim to real.
Figure 3: The architecture of the goal-conditioned jumping policy $\pi_\theta$. The policy outputs the desired motor positions $\mathbf{q}^d_m$, which are used by joint-level PD controllers to generate the motor torques $\bm{\tau}$ on the robot. The input to the policy includes the goal $\mathbf{c}$, which specifies the landing targets, the reference motion $\mathbf{q}_t^r$, which provides the robot a short preview of the reference trajectory, and a short 4-timestep history of the robot's input (robot's action $\mathbf{a}_{t-1}$) and output (robot's feedback $\mathbf{q}_t^o$). The policy is also provided with a long-term 2-second I/O history, which is first encoded by a 1D CNN. The policy updates at $33$ Hz while the rest runs at $2$ kHz.
Figure 4: Illustration of the baseline policy structures used to train the policy for bipedal jumping. (a) Ours: proposed structure as discussed in detail in Fig. \ref{['fig:controller']}. (b) Residual policy that has the same input structure as our method but outputs a residual term adding to the reference motor position lee2020learningxie2020learning. (c) Long History Only policy that only has the access to a long-term I/O history (we still provide robot immediate feedback to the base, as suggested by peng2018sim). (d) Short History Only policy that is only provided with a short-term I/O history li2021reinforcement. We also compare with the RMA kumar2021rma/Teacher-Student lee2020learning training strategy where an (e) expert policy with access to privileged environment information (Table \ref{['tab:randomization']}) is first trained by RL and is later utilized to train (f) RMA (student) policy by supervised learning. The RMA can be further finetuned using (g) A-RMA kumar2022adapting by RL. While the short I/O history is not included in the original RMA kumar2021rma or TS lee2020learning, it is included in this benchmark to have a fair comparison. The blocks are shaded if their parameters are not updated. The dash lines indicate that parameters are copied.
Figure 5: Benchmark of learning curves trained by different policy structures in Stage 3 (multi-goal training with dynamics randomization). The curves are the average normalized returns trained with $3$ random seeds while the colored areas enclose the min and max values obtained among different seeds. The normalized return is calculated by the return divided by the max episode length and in the range of $[0,1]$. Our method shows similar performance as the expert policy which is used to supervise RMAs and has access to the privileged environment parameters. The A-RMA shows the second-best performance but it requires significantly more samples compared to the proposed methods, followed by RMA. The policies with short history only or long history only show a similar learning performance but are a bit worse than RMA in terms of returns. The residual policy shows the worst performance because the reference motion added to the policy's action prevents the agent from exploring more diverse maneuvers.
Figure 6: Robustness comparison among three policies which are: (i) trained with a single task (jumping in place) with dynamics randomization, (ii) trained with a single task with dynamics randomization and random perturbation, and (iii) trained with multiple tasks with dynamics randomization but without random perturbation (proposed). The testing scenarios are outside the training setting for all three policies. The single-goal policies fail to stabilize the robot, even the one trained with extensive perturbations. The goal-conditioned policy which is trained with diverse jumping tasks but without perturbation succeeds to stabilize the robot by exploiting the learned skills. The goal-conditioned policy is able to deviate from the commands (jumping in place) and utilize a lateral jump to stay robust to the lateral external force and two forward jumps to adapt to the forward CoM offset.
...and 7 more figures

Theorems & Definitions (6)

Remark 1
Remark 2
Remark 3
Remark 4
Remark 5
Remark 6

Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

TL;DR

Abstract

Robust and Versatile Bipedal Jumping Control through Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (6)