Table of Contents
Fetching ...

Iterative Reinforcement Learning Based Design of Dynamic Locomotion Skills for Cassie

Zhaoming Xie, Patrick Clary, Jeremy Dao, Pedro Morais, Jonathan Hurst, Michiel van de Panne

TL;DR

The paper tackles the practical challenge of iteratively designing legged locomotion policies with changing rewards. It introduces DASS-based data collection to efficiently fuse reinforcement learning with imitation learning, enabling full reward redesign while keeping policy deviations manageable. The method is validated on Cassie, demonstrating stable, variable-speed gaits with successful sim-to-real transfer without dynamics randomization, and shows that policies can be compressed and distilled into smaller networks. Additionally, curriculum and distillation techniques enable robust multi-speed and multi-style locomotion, including slope walking. Overall, the approach provides a practical, data-efficient pathway for iterative locomotion policy design and deployment on real hardware.

Abstract

Deep reinforcement learning (DRL) is a promising approach for developing legged locomotion skills. However, the iterative design process that is inevitable in practice is poorly supported by the default methodology. It is difficult to predict the outcomes of changes made to the reward functions, policy architectures, and the set of tasks being trained on. In this paper, we propose a practical method that allows the reward function to be fully redefined on each successive design iteration while limiting the deviation from the previous iteration. We characterize policies via sets of Deterministic Action Stochastic State (DASS) tuples, which represent the deterministic policy state-action pairs as sampled from the states visited by the trained stochastic policy. New policies are trained using a policy gradient algorithm which then mixes RL-based policy gradients with gradient updates defined by the DASS tuples. The tuples also allow for robust policy distillation to new network architectures. We demonstrate the effectiveness of this iterative-design approach on the bipedal robot Cassie, achieving stable walking with different gait styles at various speeds. We demonstrate the successful transfer of policies learned in simulation to the physical robot without any dynamics randomization, and that variable-speed walking policies for the physical robot can be represented by a small dataset of 5-10k tuples.

Iterative Reinforcement Learning Based Design of Dynamic Locomotion Skills for Cassie

TL;DR

The paper tackles the practical challenge of iteratively designing legged locomotion policies with changing rewards. It introduces DASS-based data collection to efficiently fuse reinforcement learning with imitation learning, enabling full reward redesign while keeping policy deviations manageable. The method is validated on Cassie, demonstrating stable, variable-speed gaits with successful sim-to-real transfer without dynamics randomization, and shows that policies can be compressed and distilled into smaller networks. Additionally, curriculum and distillation techniques enable robust multi-speed and multi-style locomotion, including slope walking. Overall, the approach provides a practical, data-efficient pathway for iterative locomotion policy design and deployment on real hardware.

Abstract

Deep reinforcement learning (DRL) is a promising approach for developing legged locomotion skills. However, the iterative design process that is inevitable in practice is poorly supported by the default methodology. It is difficult to predict the outcomes of changes made to the reward functions, policy architectures, and the set of tasks being trained on. In this paper, we propose a practical method that allows the reward function to be fully redefined on each successive design iteration while limiting the deviation from the previous iteration. We characterize policies via sets of Deterministic Action Stochastic State (DASS) tuples, which represent the deterministic policy state-action pairs as sampled from the states visited by the trained stochastic policy. New policies are trained using a policy gradient algorithm which then mixes RL-based policy gradients with gradient updates defined by the DASS tuples. The tuples also allow for robust policy distillation to new network architectures. We demonstrate the effectiveness of this iterative-design approach on the bipedal robot Cassie, achieving stable walking with different gait styles at various speeds. We demonstrate the successful transfer of policies learned in simulation to the physical robot without any dynamics randomization, and that variable-speed walking policies for the physical robot can be represented by a small dataset of 5-10k tuples.

Paper Structure

This paper contains 24 sections, 5 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Cassie walking on a treadmill with a neural network policy.
  • Figure 2: A walking policy produces a limit cycle, represented by the blue closed curve, and the green arrows indicate the required feedback to return to the limit cycle.
  • Figure 3: Left: The bipedal robot Cassie used for evaluation. The red arrows indicate the axes of actuated joints, the yellow arrows indicate passive joints with stiff leaf springs attached for compliance. Right: The neural network used to parameterize the policy.
  • Figure 4: Our policy design process. Four tracking-based policies are used as a starting point. DASS samples are passed from one policy to the next according to the arrows.
  • Figure 5: Network sizes impact the final result for reinforcement learning. We observe that larger network sizes typically learn faster and yield more stable policies. Compared to the $(256, 256)$ network, the learning proceeds much more slowly for network sizes of $(64, 64)$ and $(32, 32)$, and has a larger variance, indicating the final policy is not robust to noise.
  • ...and 4 more figures