Table of Contents
Fetching ...

Deep Reinforcement Learning in Large Discrete Action Spaces

Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin

TL;DR

The paper tackles reinforcement learning in environments with extremely large discrete action sets by introducing the Wolpertinger architecture, which embeds actions into a continuous space and uses a k-nearest-neighbor lookup to restrict evaluations. Actions are proposed as continuous proto-actions, then refined by a critic to select the best discrete action, enabling sub-linear action selection when combined with approximate NN search and an actor-critic training loop (DDPG). Empirical results show the method scales to up to one million actions across discretized control, planning, and recommender tasks, with substantial speedups using approximate NN while maintaining strong performance. This approach provides a practical path to applying RL in real-world systems with enormous action spaces, such as large-scale recommender engines and industrial control processes.

Abstract

Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to generalize over the set of actions as well as sub-linear complexity relative to the size of the set are both necessary to handle such tasks. Current approaches are not able to provide both of these, which motivates the work in this paper. Our proposed approach leverages prior information about the actions to embed them in a continuous space upon which it can generalize. Additionally, approximate nearest-neighbor methods allow for logarithmic-time lookup complexity relative to the number of actions, which is necessary for time-wise tractable training. This combined approach allows reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods. We demonstrate our algorithm's abilities on a series of tasks having up to one million actions.

Deep Reinforcement Learning in Large Discrete Action Spaces

TL;DR

The paper tackles reinforcement learning in environments with extremely large discrete action sets by introducing the Wolpertinger architecture, which embeds actions into a continuous space and uses a k-nearest-neighbor lookup to restrict evaluations. Actions are proposed as continuous proto-actions, then refined by a critic to select the best discrete action, enabling sub-linear action selection when combined with approximate NN search and an actor-critic training loop (DDPG). Empirical results show the method scales to up to one million actions across discretized control, planning, and recommender tasks, with substantial speedups using approximate NN while maintaining strong performance. This approach provides a practical path to applying RL in real-world systems with enormous action spaces, such as large-scale recommender engines and industrial control processes.

Abstract

Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to generalize over the set of actions as well as sub-linear complexity relative to the size of the set are both necessary to handle such tasks. Current approaches are not able to provide both of these, which motivates the work in this paper. Our proposed approach leverages prior information about the actions to embed them in a continuous space upon which it can generalize. Additionally, approximate nearest-neighbor methods allow for logarithmic-time lookup complexity relative to the number of actions, which is necessary for time-wise tractable training. This combined approach allows reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods. We demonstrate our algorithm's abilities on a series of tasks having up to one million actions.

Paper Structure

This paper contains 22 sections, 1 theorem, 12 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Denote the closest $k$ actions as integers $\{1, \ldots, k\}$. Then in the scenario as described above, the expected value of the maximum of the $k$ closest actions is

Figures (11)

  • Figure 1: Wolpertinger Architecture
  • Figure 2: Agent performance for various settings of $k$ with exact lookup as a function of steps. With 0.5% of neighbors, training time is prohibitively slow and convergence is not achieved.
  • Figure 3: Agent performance for various settings of $k$ and FLANN as a function of wall-time on one million action cart-pole. We can see that with 0.5% of neighbors, training time is prohibitively slow.
  • Figure 4: Agent performance for various lengths of plan, a plan of $n=20$ corresponds to $2^{20}=1,048,576$ actions. The agent is able to learn faster with longer plan lengths. $k=1$ and 'slow' FLANN settings are used.
  • Figure 5: Agent performance for various percentages of $k$ in a 20-step plan task in Puddle World with FLANN settings on 'slow'.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof