Table of Contents
Fetching ...

A Brief Survey of Deep Reinforcement Learning

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath

TL;DR

Deep reinforcement learning enables end-to-end learning for control and perception by combining RL with deep neural networks. The paper surveys core DRL paradigms, contrasting value-based and policy-based methods, and detailing key algorithms such as the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor-critic (A3C). It discusses how deep representations address the curse of dimensionality, explores planning vs model-based methods, and surveys current challenges including exploration, memory, transfer, and multi-agent settings. The work highlights benchmarks like Atari ALE and MuJoCo, and argues for integrating DRL with other AI techniques to achieve more data-efficient, generalizable, and capable autonomous agents.

Abstract

Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep $Q$-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.

A Brief Survey of Deep Reinforcement Learning

TL;DR

Deep reinforcement learning enables end-to-end learning for control and perception by combining RL with deep neural networks. The paper surveys core DRL paradigms, contrasting value-based and policy-based methods, and detailing key algorithms such as the deep Q-network (DQN), trust region policy optimization (TRPO), and asynchronous advantage actor-critic (A3C). It discusses how deep representations address the curse of dimensionality, explores planning vs model-based methods, and surveys current challenges including exploration, memory, transfer, and multi-agent settings. The work highlights benchmarks like Atari ALE and MuJoCo, and argues for integrating DRL with other AI techniques to achieve more data-efficient, generalizable, and capable autonomous agents.

Abstract

Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep -network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.

Paper Structure

This paper contains 27 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: A range of visual RL domains. (a) Two classic Atari 2600 video games, "Freeway" and "Seaquest", from the Arcade Learning Environment (ALE) bellemare2015arcade. Due to the range of supported games that vary in genre, visuals and difficulty, the ALE has become a standard testbed for DRL algorithms mnih2015humanoh2015actionhausknecht2015deepschulman2015truststadie2015incentivizingwang2016duelingmnih2016asynchronous. As we will discuss later, the ALE is one of several benchmarks that are now being used to standardise evaluation in RL. (b) The TORCS car racing simulator, which has been used to test DRL algorithms that can output continuous actions koutnik2013evolvinglillicrap2016continuousmnih2016asynchronous (as the games from the ALE only support discrete actions). (c) Utilising the potentially unlimited amount of training data that can be amassed in robotic simulators, several methods aim to transfer knowledge from the simulator to the real world christiano2016transferrusu2017simtzeng2016towards. (d) Two of the four robotic tasks designed by Levine et al. levine2016end: screwing on a bottle cap and placing a shaped block in the correct hole. Levine et al. levine2016end were able to train visuomotor policies in an end-to-end fashion, showing that visual servoing could be learned directly from raw camera inputs by using deep neural networks. (e) A real room, in which a wheeled robot trained to navigate the building is given a visual cue as input, and must find the corresponding location zhu2017target. (f) A natural image being captioned by a neural network that uses reinforcement learning to choose where to look xu2015show. By processing a small portion of the image for every word generated, the network can focus its attention on the most salient points. Figures reproduced from bellemare2015arcadelillicrap2016continuoustzeng2016towardslevine2016endzhu2017targetxu2015show, respectively.
  • Figure 2: The perception-action-learning loop. At time $t$, the agent receives state $\mathbf{s}_t$ from the environment. The agent uses its policy to choose an action $\mathbf{a}_t$. Once the action is executed, the environment transitions a step, providing the next state $\mathbf{s}_{t+1}$ as well as feedback in the form of a reward $r_{t+1}$. The agent uses knowledge of state transitions, of the form $(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1}, r_{t+1})$, in order to learn and improve its policy.
  • Figure 3: Two dimensions of RL algorithms, based on the backups used to learn or construct a policy. At the extremes of these dimensions are (a) dynamic programming, (b) exhaustive search, (c) one-step TD learning and (d) pure Monte Carlo approaches. Bootstrapping extends from (c) 1-step TD learning to $n$-step TD learning methods sutton1998reinforcement, with (d) pure Monte Carlo approaches not relying on bootstrapping at all. Another possible dimension of variation is choosing to (c, d) sample actions versus (a, b) taking the expectation over all choices. Recreated from sutton1998reinforcement.
  • Figure 4: Actor-critic set-up. The actor (policy) receives a state from the environment and chooses an action to perform. At the same time, the critic (value function) receives the state and reward resulting from the previous interaction. The critic uses the TD error calculated from this information to update itself and the actor. Recreated from sutton1998reinforcement.
  • Figure 5: The deep $Q$-network mnih2015human. The network takes the state---a stack of greyscale frames from the video game---and processes it with convolutional and fully connected layers, with ReLU nonlinearities in between each layer. At the final layer, the network outputs a discrete action, which corresponds to one of the possible control inputs for the game. Given the current state and chosen action, the game returns a new score. The DQN uses the reward---the difference between the new score and the previous one---to learn from its decision. More precisely, the reward is used to update its estimate of $Q$, and the error between its previous estimate and its new estimate is backpropagated through the network.
  • ...and 1 more figures