Benchmarking Population-Based Reinforcement Learning across Robotic Tasks with GPU-Accelerated Simulation

Asad Ali Shahid; Yashraj Narang; Vincenzo Petrone; Enrico Ferrentino; Ankur Handa; Dieter Fox; Marco Pavone; Loris Roveda

Benchmarking Population-Based Reinforcement Learning across Robotic Tasks with GPU-Accelerated Simulation

Asad Ali Shahid, Yashraj Narang, Vincenzo Petrone, Enrico Ferrentino, Ankur Handa, Dieter Fox, Marco Pavone, Loris Roveda

TL;DR

This work addresses the data inefficiency of deep reinforcement learning in robotics by combining GPU-accelerated simulation with population-based training to enhance exploration and adapt hyperparameters online. It systematically benchmarks Population-Based Reinforcement Learning (PBRL) against PPO, SAC, and DDPG across four Isaac Gym tasks, and demonstrates a sim-to-real transfer by deploying a PBRL policy on a Franka Panda without additional adaptation. The results show that PBRL often yields higher final rewards and faster convergence, with performance gains varying by task and algorithm; the real-world deployment further validates the approach. The authors release an open-source codebase to enable broader exploration of PBRL in challenging robotic manipulation tasks, highlighting the practical impact for scalable, robust learning in robotics.

Abstract

In recent years, deep reinforcement learning (RL) has shown its effectiveness in solving complex continuous control tasks. However, this comes at the cost of an enormous amount of experience required for training, exacerbated by the sensitivity of learning efficiency and the policy performance to hyperparameter selection, which often requires numerous trials of time-consuming experiments. This work leverages a Population-Based Reinforcement Learning (PBRL) approach and a GPU-accelerated physics simulator to enhance the exploration capabilities of RL by concurrently training multiple policies in parallel. The PBRL framework is benchmarked against three state-of-the-art RL algorithms -- PPO, SAC, and DDPG -- dynamically adjusting hyperparameters based on the performance of learning agents. The experiments are performed on four challenging tasks in Isaac Gym -- Anymal Terrain, Shadow Hand, Humanoid, Franka Nut Pick -- by analyzing the effect of population size and mutation mechanisms for hyperparameters. The results show that PBRL agents achieve superior performance, in terms of cumulative reward, compared to non-evolutionary baseline agents. Moreover, the trained agents are finally deployed in the real world for a Franka Nut Pick task. To our knowledge, this is the first sim-to-real attempt for deploying PBRL agents on real hardware. Code and videos of the learned policies are available on our project website (https://sites.google.com/view/pbrl).

Benchmarking Population-Based Reinforcement Learning across Robotic Tasks with GPU-Accelerated Simulation

TL;DR

Abstract

Paper Structure (18 sections, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Massively-Parallel Simulation
Population-Based RL
Sim-to-Real Transfer
Contribution
Methods
Reinforcement Learning
Population-Based Training
Experiments
Results
PBRL-PPO
PBRL-SAC
PBRL-DDPG
Mutation Comparison
...and 3 more sections

Figures (6)

Figure 1: Simulated experiments are performed on four Isaac Gym benchmark tasks: (\ref{['fig:anymal']}) Anymal Terrain, to teach a quadruped robot to navigate uneven terrain; (\ref{['fig:shadow-hand']}) Shadow Hand, to re-orient cube to a desired configuration with a robot hand; (\ref{['fig:humanoid']}) Humanoid, for bipedal locomotion; and (\ref{['fig:franka-nut-pick']}) Franka Nut Pick, to grasp and lift a nut from a surface.
Figure 2: The PBRL framework learns robotic tasks through a combination of RL, evolutionary selection, and GPU-based parallel simulations.
Figure 3: Training results of baseline PPO (top), SAC (middle), and DDPG (bottom), along with their PBRL counterparts for $\lvert \mathcal{P} \rvert \in \{ 4, 8, 16 \}$. The shaded area shows the standard deviation around the mean performance across agents in $S$, or among 8 seeds in non-evolutionary baselines. SAC and DDPG are not evaluated on 16 agents due to higher memory usage.
Figure 4: Comparison of different mutation schemes for PBRL-PPO (top) and PBRL-DDPG (bottom) with $\lvert \mathcal{P} \rvert = 4$.
Figure 5: Success rate of Franka Nut Pick with PPO baseline and PBRL-PPO in simulation for $\lvert \mathcal{P} \rvert \in \{4, 8, 16\}$.
...and 1 more figures

Benchmarking Population-Based Reinforcement Learning across Robotic Tasks with GPU-Accelerated Simulation

TL;DR

Abstract

Benchmarking Population-Based Reinforcement Learning across Robotic Tasks with GPU-Accelerated Simulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)