Table of Contents
Fetching ...

Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU

Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, Jan Kautz

TL;DR

GA3C addresses the computational bottlenecks of asynchronous deep RL by centralizing the neural network on a GPU and decoupling data generation from training through prediction and training queues. It introduces dynamic scheduling to adapt NP, NT, and NA, maximizing GPU throughput while maintaining learning stability. Empirical results show significant training-throughput gains and faster convergence on Atari-2600 tasks compared with CPU A3C, with performance scaling with neural network size. The work offers a detailed analysis of latency, queue dynamics, and the TPS/PPS trade-offs and provides open-source code to facilitate broader adoption and further research.

Abstract

We introduce a hybrid CPU/GPU version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks. We analyze its computational traits and concentrate on aspects critical to leveraging the GPU's computational power. We introduce a system of queues and a dynamic scheduling strategy, potentially helpful for other asynchronous algorithms as well. Our hybrid CPU/GPU version of A3C, based on TensorFlow, achieves a significant speed up compared to a CPU implementation; we make it publicly available to other researchers at https://github.com/NVlabs/GA3C .

Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU

TL;DR

GA3C addresses the computational bottlenecks of asynchronous deep RL by centralizing the neural network on a GPU and decoupling data generation from training through prediction and training queues. It introduces dynamic scheduling to adapt NP, NT, and NA, maximizing GPU throughput while maintaining learning stability. Empirical results show significant training-throughput gains and faster convergence on Atari-2600 tasks compared with CPU A3C, with performance scaling with neural network size. The work offers a detailed analysis of latency, queue dynamics, and the TPS/PPS trade-offs and provides open-source code to facilitate broader adoption and further research.

Abstract

We introduce a hybrid CPU/GPU version of the Asynchronous Advantage Actor-Critic (A3C) algorithm, currently the state-of-the-art method in reinforcement learning for various gaming tasks. We analyze its computational traits and concentrate on aspects critical to leveraging the GPU's computational power. We introduce a system of queues and a dynamic scheduling strategy, potentially helpful for other asynchronous algorithms as well. Our hybrid CPU/GPU version of A3C, based on TensorFlow, achieves a significant speed up compared to a CPU implementation; we make it publicly available to other researchers at https://github.com/NVlabs/GA3C .

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of A3C and GA3C architectures. Agents act concurrently both in A3C and GA3C. In A3C, however, each agent has a replica of the model, whereas in GA3C there is only one GPU instance of the model. In GA3C, agents utilize predictors to query the network for policies while trainers gather experiences for network updates.
  • Figure 2: Automatic dynamic adjustment of $N_T$, $N_P$, and $N_A$, to maximize TPS for Boxing (left) and Pong (right), starting from a sub-optimal configuration ($N_A\!=\!N_T\!=\!N_P\!=\!1$)
  • Figure 3: TPS of the top three configurations of predictors $N_P$ and trainers $N_T$ for several settings of agents $N_A$, while learning pong on System I from Table \ref{['tab:sys_prof_config']}. TPS is normalized by best performance after $16$ minutes. Larger DNN models are also shown, as described in the text.
  • Figure 4: The average training queue size (left) and prediction batch size (right) of the top $3$ performing configurations of $N_P$ and $N_T$, for each $N_A$, with pong and the System I in Table \ref{['tab:sys_prof_config']}.
  • Figure 5: Effect of PPS on convergence speed. For each game, four different settings of GA3C are shown, all starting from the same DNN initialization. Numbers on the right show the cumulative number of frames played among all agents for each setting over the course of $3$ hours. Configurations playing more frames converge faster. The dynamic configuration method is capable of catching up with the optimal configuration despite starting with a sub-optimal setting, $N_T\!=\!N_P\!=\!N_A\!=\!1$.
  • ...and 2 more figures