Accelerating Reinforcement Learning through GPU Atari Emulation
Steven Dalton, Iuri Frosio, Michael Garland
TL;DR
CuLE ported ALE to CUDA to render Atari frames on the GPU, enabling thousands of parallel environments and eliminating CPU-GPU bottlenecks for DRL. It introduces a two-kernel GPU emulator, a caching strategy for initial states, and a batching approach with A2C+V-trace to achieve fast, scalable training on single- and multi-GPU systems. The results show substantial throughput gains (up to 155M frames/hour on one GPU and up to 187K FPS with four GPUs) and faster wall-clock convergence, while highlighting trade-offs with memory and per-environment update rates. The work provides an open-source tool and a framework for exploring high-throughput, GPU-based RL research across varied hardware configurations.
Abstract
We introduce CuLE (CUDA Learning Environment), a CUDA port of the Atari Learning Environment (ALE) which is used for the development of deep reinforcement algorithms. CuLE overcomes many limitations of existing CPU-based emulators and scales naturally to multiple GPUs. It leverages GPU parallelization to run thousands of games simultaneously and it renders frames directly on the GPU, to avoid the bottleneck arising from the limited CPU-GPU communication bandwidth. CuLE generates up to 155M frames per hour on a single GPU, a finding previously achieved only through a cluster of CPUs. Beyond highlighting the differences between CPU and GPU emulators in the context of reinforcement learning, we show how to leverage the high throughput of CuLE by effective batching of the training data, and show accelerated convergence for A2C+V-trace. CuLE is available at https://github.com/NVLabs/cule .
