Table of Contents
Fetching ...

Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control

Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, Koushil Sreenath

TL;DR

The paper presents a general deep RL framework for versatile bipedal locomotion, introducing a dual-history I/O policy that leverages both long-term dynamics and short-term feedback to achieve robust walking, running, and jumping, including real-world deployment on Cassie. A three-stage training regime—single-task, task randomization, and dynamics randomization—enables zero-shot sim-to-real transfer and enhances disturbance robustness beyond traditional dynamics randomization. Extensive ablations show the superiority of end-to-end training of a dual-history policy over residual, distillation, or state-only baselines, with demonstrated long-horizon adaptivity to time-varying contacts, unknown terrain, and perturbations. Qualitative and quantitative real-world results include consistent in-place walking over years, a 400 m dash, 100 m dash, and a wide range of jumping maneuvers, illustrating practical impact for dynamic, high-DoF humanoid robots. The findings emphasize task randomization as a key robustness source and suggest that RL can jointly perform trajectory optimization, contact planning, and control without hand-crafted contact schedules, pointing toward unified policies for diverse legged locomotion tasks.

Abstract

This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world. The study also delves into the adaptivity and robustness introduced by the proposed RL system in developing locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot's I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jumping skills, such as standing long jumps and high jumps.

Reinforcement Learning for Versatile, Dynamic, and Robust Bipedal Locomotion Control

TL;DR

The paper presents a general deep RL framework for versatile bipedal locomotion, introducing a dual-history I/O policy that leverages both long-term dynamics and short-term feedback to achieve robust walking, running, and jumping, including real-world deployment on Cassie. A three-stage training regime—single-task, task randomization, and dynamics randomization—enables zero-shot sim-to-real transfer and enhances disturbance robustness beyond traditional dynamics randomization. Extensive ablations show the superiority of end-to-end training of a dual-history policy over residual, distillation, or state-only baselines, with demonstrated long-horizon adaptivity to time-varying contacts, unknown terrain, and perturbations. Qualitative and quantitative real-world results include consistent in-place walking over years, a 400 m dash, 100 m dash, and a wide range of jumping maneuvers, illustrating practical impact for dynamic, high-DoF humanoid robots. The findings emphasize task randomization as a key robustness source and suggest that RL can jointly perform trajectory optimization, contact planning, and control without hand-crafted contact schedules, pointing toward unified policies for diverse legged locomotion tasks.

Abstract

This paper presents a comprehensive study on using deep reinforcement learning (RL) to create dynamic locomotion controllers for bipedal robots. Going beyond focusing on a single locomotion skill, we develop a general control solution that can be used for a range of dynamic bipedal skills, from periodic walking and running to aperiodic jumping and standing. Our RL-based controller incorporates a novel dual-history architecture, utilizing both a long-term and short-term input/output (I/O) history of the robot. This control architecture, when trained through the proposed end-to-end RL approach, consistently outperforms other methods across a diverse range of skills in both simulation and the real world. The study also delves into the adaptivity and robustness introduced by the proposed RL system in developing locomotion controllers. We demonstrate that the proposed architecture can adapt to both time-invariant dynamics shifts and time-variant changes, such as contact events, by effectively using the robot's I/O history. Additionally, we identify task randomization as another key source of robustness, fostering better task generalization and compliance to disturbances. The resulting control policies can be successfully deployed on Cassie, a torque-controlled human-sized bipedal robot. This work pushes the limits of agility for bipedal robots through extensive real-world experiments. We demonstrate a diverse range of locomotion skills, including: robust standing, versatile walking, fast running with a demonstration of a 400-meter dash, and a diverse set of jumping skills, such as standing long jumps and high jumps.
Paper Structure (122 sections, 2 equations, 34 figures, 8 tables)

This paper contains 122 sections, 2 equations, 34 figures, 8 tables.

Figures (34)

  • Figure 2: Overview of this paper. First, Sec. \ref{['sec:background']} introduces the formulation of the locomotion control problem and the importance of utilizing the robot's I/O history. The details of our dual-history-based control architecture for various bipedal locomotion skills are presented in Sec. \ref{['sec:controller']}, followed by the training scheme discussed in Sec. \ref{['sec:training']}. Then, detailed studies are conducted to validate the advantages of the proposed policy structure in Sec. \ref{['sec:policy_structure']}, sources of adaptivity in Sec. \ref{['subsec:adaptivity']}, and we investigate the sources of the robust behaviors observed in the proposed RL-based controller in Sec. \ref{['sec:multi_skill']}. Extensive experiments using the proposed RL-based locomotion controllers to enable Cassie to perform robust standing, walking, running, and jumping skills are presented in Sec. \ref{['sec:experiments']}. Insights and discussions for readers interested in applying RL to train bipedal robots are provided in Sec. \ref{['sec:discussion']}.
  • Figure 3: The proposed RL-based controller architecture that leverages a dual-history of input ($\mathbf{a}$) and output ($\mathbf{o}$) (I/O) from the robot. The control policy $\pi_\theta$, operating at 33 Hz, processes a 2-second long I/O history. This data is initially encoded via a 1D CNN along its time axis before being merged with a base MLP. In addition, a short history spanning 4 timesteps is directly input into the base MLP, combined with skill-specific reference motion $\mathbf{q}^r_t$ and variable commands $\mathbf{c}_t$ that parameterize the tasks. The policy outputs desired motor positions $\mathbf{q}^d_m$ as the robot's actions, which are then smoothed using a low-pass filter (LPF). These filtered outputs are employed by joint-level PD controllers operating at 2 kHz to specify motor torques $\bm{\tau}$. This architecture is general for various locomotion skills like standing, walking, running, and jumping. This figure also annotates the generalized coordinates for Cassie, which include actuated joints ($q^{L/R}_{1,2,3,4,7}$, marked as red) and passive joints ($q^{L/R}_{5,6}$, marked as blue).
  • Figure 4: The multi-stage training framework to obtain a versatile control policy that can be zero-shot transferred to the real world. It starts with single-task training stage, where the robot is encouraged to mimic a single reference motion with a fixed goal. This is followed by task randomization stage, which expands the range of tasks the robot learns and fosters task generalization resulting in a versatile policy. Once the robot is adept at various locomotion tasks and their transitions, extensive dynamics randomization is incorporated to enhance policy robustness for sim-to-real transfer. This framework is suitable for diverse bipedal locomotion skills, including walking, running, and jumping, and for learning from different sources of skill-specific reference motions such as trajectory optimization, human mocap, and animation.
  • Figure 5: Illustration of our proposed and various baselines for RL-based control policy architectures for bipedal robot locomotion. Fig. \ref{['fig:benchmark']}a, Ours integrates both short and long-term I/O histories, with the base MLP and long history encoder jointly trained to specify motor positions. Fig. \ref{['fig:benchmark']}b, the Residual approach aligns with our architecture but adds a residual term to the reference motor position. Fig. \ref{['fig:benchmark']}c, the State Feedback Only baseline uses our model structure but relies solely on robot's states history, excluding input history. Fig. \ref{['fig:benchmark']}d, the Long History Only approach depends on long I/O history without using short I/O history, while the Short History Only approach (Fig. \ref{['fig:benchmark']}e) focuses only on short-term I/O history, excluding the CNN encoder. The RMA/Teacher-Student method utilizes a two-phase policy distillation, with an expert (teacher) policy (Fig. \ref{['fig:benchmark']}f) guiding the training of an RMA (student) policy (Fig. \ref{['fig:benchmark']}g), which can be improved by A-RMA (Fig. \ref{['fig:benchmark']}h) which introduces an additional phase where the base MLP is finetuned while keeping the long I/O history encoder's parameters fixed. Notably, all expert, RMA, and A-RMA policies in this study incorporate short I/O histories into the base MLP, which is a new modification in this work to enable equitable comparison with ours. All of these architectures have the command and reference motion as input to the base MLP, as detailed in Fig. \ref{['fig:controller']} and omitted for brevity.
  • Figure 6: The learning performance using different policy structure designs illustrated in Fig. \ref{['fig:benchmark']}. It is assessed during Stage 3 training, which incorporates both task and dynamics randomization. These curves represent the average normalized episodic return across 3 distinct policy trainings from different random seeds, with shaded regions indicating the range between minimum and maximum returns. Note that there is no perturbation training for the jumping policy as it prevents the robot from learning the dynamic jumping skill. Our proposed method consistently outperformed other baselines across different skills. Notably, our policy's performance is comparable to that of the expert policy, which has access to privileged information but is not deployable in the real world. In contrast, the residual method shows the worst return. Using only a long history does not offer a clear benefit compared to a policy relying solely on a short history. In fact, the short history only policy even outperforms the long history only approach. Additionally, even with a dual-history approach like ours, omitting the robot's input history (only using state feedback) results in no improvement over short I/O history only. The student or RMA methods exhibit significant regression loss in bipedal locomotion control; particularly in dynamic skills like running, RMA fails to learn. This suggests the necessity of the A-RMA stage for further training, though it requires considerably more training samples and yields slightly lower returns compared to our method.
  • ...and 29 more figures

Theorems & Definitions (4)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4