Going into Orbit: Massively Parallelizing Episodic Reinforcement Learning
Jan Oberst, Johann Bonneau
TL;DR
The paper addresses the challenge of efficiently training robotic reinforcement learning agents in realistic simulations by introducing NVIDIA's Orbit as a GPU-accelerated framework that integrates Isaac Sim with multiple RL libraries. It presents a detailed box-pushing benchmark implemented in Orbit, including both a step-based and a black-box reinforcement learning (BBRL) pathway using movement primitives and Probabilistic Movement Primitives (ProMP). Through experiments comparing Orbit to Fancy Gym (MuJoCo) and by tuning Orbit for high parallelism (up to 4096 environments), the work demonstrates substantial gains in sample throughput and training speed, while also highlighting reproducibility and simulator-variance issues across platforms. The findings underscore Orbit's potential to accelerate robotics RL research and benchmarking, while pointing to future work on broader benchmarks, randomization strategies, and cross-simulator comparisons to better characterize performance and generalization.
Abstract
The possibilities of robot control have multiplied across various domains through the application of deep reinforcement learning. To overcome safety and sampling efficiency issues, deep reinforcement learning models can be trained in a simulation environment, allowing for faster iteration cycles. This can be enhanced further by parallelizing the training process using GPUs. NVIDIA's open-source robot learning framework Orbit leverages this potential by wrapping tensor-based reinforcement learning libraries for high parallelism and building upon Isaac Sim for its simulations. We contribute a detailed description of the implementation of a benchmark reinforcement learning task, namely box pushing, using Orbit. Additionally, we benchmark the performance of our implementation in comparison to a CPU-based implementation and report the performance metrics. Finally, we tune the hyper parameters of our implementation and show that we can generate significantly more samples in the same amount of time by using Orbit.
