Table of Contents
Fetching ...

Accelerating Reinforcement Learning through GPU Atari Emulation

Steven Dalton, Iuri Frosio, Michael Garland

TL;DR

CuLE ported ALE to CUDA to render Atari frames on the GPU, enabling thousands of parallel environments and eliminating CPU-GPU bottlenecks for DRL. It introduces a two-kernel GPU emulator, a caching strategy for initial states, and a batching approach with A2C+V-trace to achieve fast, scalable training on single- and multi-GPU systems. The results show substantial throughput gains (up to 155M frames/hour on one GPU and up to 187K FPS with four GPUs) and faster wall-clock convergence, while highlighting trade-offs with memory and per-environment update rates. The work provides an open-source tool and a framework for exploring high-throughput, GPU-based RL research across varied hardware configurations.

Abstract

We introduce CuLE (CUDA Learning Environment), a CUDA port of the Atari Learning Environment (ALE) which is used for the development of deep reinforcement algorithms. CuLE overcomes many limitations of existing CPU-based emulators and scales naturally to multiple GPUs. It leverages GPU parallelization to run thousands of games simultaneously and it renders frames directly on the GPU, to avoid the bottleneck arising from the limited CPU-GPU communication bandwidth. CuLE generates up to 155M frames per hour on a single GPU, a finding previously achieved only through a cluster of CPUs. Beyond highlighting the differences between CPU and GPU emulators in the context of reinforcement learning, we show how to leverage the high throughput of CuLE by effective batching of the training data, and show accelerated convergence for A2C+V-trace. CuLE is available at https://github.com/NVLabs/cule .

Accelerating Reinforcement Learning through GPU Atari Emulation

TL;DR

CuLE ported ALE to CUDA to render Atari frames on the GPU, enabling thousands of parallel environments and eliminating CPU-GPU bottlenecks for DRL. It introduces a two-kernel GPU emulator, a caching strategy for initial states, and a batching approach with A2C+V-trace to achieve fast, scalable training on single- and multi-GPU systems. The results show substantial throughput gains (up to 155M frames/hour on one GPU and up to 187K FPS with four GPUs) and faster wall-clock convergence, while highlighting trade-offs with memory and per-environment update rates. The work provides an open-source tool and a framework for exploring high-throughput, GPU-based RL research across varied hardware configurations.

Abstract

We introduce CuLE (CUDA Learning Environment), a CUDA port of the Atari Learning Environment (ALE) which is used for the development of deep reinforcement algorithms. CuLE overcomes many limitations of existing CPU-based emulators and scales naturally to multiple GPUs. It leverages GPU parallelization to run thousands of games simultaneously and it renders frames directly on the GPU, to avoid the bottleneck arising from the limited CPU-GPU communication bandwidth. CuLE generates up to 155M frames per hour on a single GPU, a finding previously achieved only through a cluster of CPUs. Beyond highlighting the differences between CPU and GPU emulators in the context of reinforcement learning, we show how to leverage the high throughput of CuLE by effective batching of the training data, and show accelerated convergence for A2C+V-trace. CuLE is available at https://github.com/NVLabs/cule .

Paper Structure

This paper contains 22 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Our CUDA-based Atari emulator uses an Atari CPU kernel to emulate the functioning of the Atari CPU and advance the game state, and a second TIA kernel to emulate the TIA and render frames directly in GPU memory. For episode resetting we generate and store a cache of random initial states. Massive parallelization on GPU threads allows the parallel emulation of thousands of Atari games.
  • Figure 2: FPS and FPS / environment on System I in Table \ref{['tab:systems']}, for OpenAI Gym openaiblog, CuLECPU, and CuLE, as a function of the number of environments, under different load conditions: emulation only, and inference only. The boxplots indicate the minimum, $25^{th}$, $50^{th}$, $75^{th}$ percentiles and maximum FPS, for the entire set of 57 Atari games.
  • Figure 3: FPS as a function of the environment step, measured on System I in Table \ref{['tab:systems']} for emulation only on four Atari games, 512 environments, for CuLE; each panel also shows the number of resetting environments. FPS is higher at the beginning, when all environments are in similar states and thread divergence within warps is minimized; after some steps, correlation is lost, FPS decreases and stabilizes. Minor oscillations in FPS are possibly associated to more or less computational demanding phases in the emulation of the environments (e.g., when a goal is scored in Pong).
  • Figure 4: FPS generated by different emulation engines on System I in Table \ref{['tab:systems']} for Assault, as a function of the number of environments, and different load conditions for A2C with N-step bootstrapping, $N=5$).
  • Figure 5: Average testing score and standard deviation on four Atari games as a function of the training time, for A2C+V-trace, System III in Table \ref{['tab:systems']}, and different batching strategies (see also Table \ref{['tab:a2c_vtrace']}). Training frames are double for the multi-GPU case (black line). Training performed on CuLE or OpenAI Gym; testing performed on OpenAI Gym environments (see the last paragraph of Section \ref{['sec:experiments']}).
  • ...and 3 more figures