Table of Contents
Fetching ...

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

Nikita Rudin, David Hoeller, Philipp Reist, Marco Hutter

TL;DR

This work tackles the long training times of deep reinforcement learning for legged locomotion by introducing a GPU-based, massively parallel training pipeline using Isaac Gym. It combines an on-policy PPO algorithm with a game-inspired automatic curriculum to train thousands of simulated robots simultaneously, achieving sub-4-minute training on flat terrain and ~20 minutes on uneven terrain for the ANYmal quadruped, with successful sim-to-real transfer. Key contributions include a thorough analysis of parallelism effects, hyper-parameter adaptations, and a robust curriculum that scales to diverse terrains, plus open-source code to accelerate further research. The results demonstrate rapid, repeatable policy generation that generalizes across robot variants and facilitates deployment on real hardware, signaling a shift toward faster real-world DRL development for legged locomotion.

Abstract

In this work, we present and study a training set-up that achieves fast policy generation for real-world robotic tasks by using massive parallelism on a single workstation GPU. We analyze and discuss the impact of different training algorithm components in the massively parallel regime on the final policy performance and training times. In addition, we present a novel game-inspired curriculum that is well suited for training with thousands of simulated robots in parallel. We evaluate the approach by training the quadrupedal robot ANYmal to walk on challenging terrain. The parallel approach allows training policies for flat terrain in under four minutes, and in twenty minutes for uneven terrain. This represents a speedup of multiple orders of magnitude compared to previous work. Finally, we transfer the policies to the real robot to validate the approach. We open-source our training code to help accelerate further research in the field of learned legged locomotion.

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

TL;DR

This work tackles the long training times of deep reinforcement learning for legged locomotion by introducing a GPU-based, massively parallel training pipeline using Isaac Gym. It combines an on-policy PPO algorithm with a game-inspired automatic curriculum to train thousands of simulated robots simultaneously, achieving sub-4-minute training on flat terrain and ~20 minutes on uneven terrain for the ANYmal quadruped, with successful sim-to-real transfer. Key contributions include a thorough analysis of parallelism effects, hyper-parameter adaptations, and a robust curriculum that scales to diverse terrains, plus open-source code to accelerate further research. The results demonstrate rapid, repeatable policy generation that generalizes across robot variants and facilitates deployment on real hardware, signaling a shift toward faster real-world DRL development for legged locomotion.

Abstract

In this work, we present and study a training set-up that achieves fast policy generation for real-world robotic tasks by using massive parallelism on a single workstation GPU. We analyze and discuss the impact of different training algorithm components in the massively parallel regime on the final policy performance and training times. In addition, we present a novel game-inspired curriculum that is well suited for training with thousands of simulated robots in parallel. We evaluate the approach by training the quadrupedal robot ANYmal to walk on challenging terrain. The parallel approach allows training policies for flat terrain in under four minutes, and in twenty minutes for uneven terrain. This represents a speedup of multiple orders of magnitude compared to previous work. Finally, we transfer the policies to the real robot to validate the approach. We open-source our training code to help accelerate further research in the field of learned legged locomotion.

Paper Structure

This paper contains 23 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Thousands of robots learning to walk in simulation.
  • Figure 2: Terrain types used for training and testing in simulation. (a) Randomly rough terrain with variations of 0.1m. (b) Sloped terrain with an inclination of 25. (c) Stairs with a width of 0.3m and height of 0.2m. (d) Randomized, discrete obstacles with heights of up to ± 0.2m.
  • Figure 3: 4000 robots progressing through the terrains with automatic curriculum, after 500 (top) and 1000 (bottom) policy updates. The robots start the training session on the first row (closest to the camera) and progressively reach harder terrains.
  • Figure 4: (a) Average and standard deviation (over 5 runs) of the total reward of an episode after 1500 policy updates for different number of robots and 3 different batch sizes. The ideal case of a batch size of 1M samples with 20000 robots is shown in red. (b) Total training time for the same experiments. (c) Reward dependency on total training time. Colors represent the number of robots, while shapes show the batch size (circles: 49152, crosses: 98304, triangles: 196608). Points in the upper left part of the graph (highlighted in green) represent the most desirable configuration.
  • Figure 5: Success rate of the tested policy on increasing terrain complexities. Robots start in the center of the terrain and are given a forward velocity command of 0.75m/s, and a side velocity command randomized within $[-0.1, 0.1]$ m/s. (a) Success rate for climbing stairs, descending stairs and traversing discrete obstacles. (b) Success rate for climbing and descending sloped terrains.
  • ...and 5 more figures