Table of Contents
Fetching ...

Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel

TL;DR

The paper tackles multi-agent reinforcement learning in a dense, pixel-based, 3D environment by training a population of agents with internal reward signals and a two-tier temporal hierarchy. The FTW framework combines population-based training, internal reward shaping, and memory-enabled hierarchical RL to achieve human-level performance in Capture the Flag on procedurally generated Quake III Arena maps, surpassing strong humans and prior agents. It demonstrates robust generalization to unseen teammates, opponents, maps, and team sizes, and provides deep analyses of learned representations and emergent high-level behaviors. The work advances scalable, end-to-end learning in complex multi-agent settings and highlights the potential for memory and temporal abstraction to drive sophisticated strategic play.

Abstract

Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments and two-player turn-based games. However, the real-world contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display human-like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the win-rate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of human-level intelligence.

Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

TL;DR

The paper tackles multi-agent reinforcement learning in a dense, pixel-based, 3D environment by training a population of agents with internal reward signals and a two-tier temporal hierarchy. The FTW framework combines population-based training, internal reward shaping, and memory-enabled hierarchical RL to achieve human-level performance in Capture the Flag on procedurally generated Quake III Arena maps, surpassing strong humans and prior agents. It demonstrates robust generalization to unseen teammates, opponents, maps, and team sizes, and provides deep analyses of learned representations and emergent high-level behaviors. The work advances scalable, end-to-end learning in complex multi-agent settings and highlights the potential for memory and temporal abstraction to drive sophisticated strategic play.

Abstract

Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments and two-player turn-based games. However, the real-world contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag, using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display human-like behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the win-rate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a significant jump in the capabilities of artificial agents, bringing us closer to the goal of human-level intelligence.

Paper Structure

This paper contains 25 sections, 11 equations, 15 figures.

Figures (15)

  • Figure 1: CTF task and computational training framework.Shown are two example maps that have been sampled from the distribution of outdoor maps (a) and indoor maps (b). Each agent in the game only sees its own first-person pixel view of the environment (c). Training data is generated by playing thousands of CTF games in parallel on a diverse distribution of procedurally generated maps (d), and used to train the agents that played in each game with reinforcement learning (e). We train a population of 30 different agents together, which provides a diverse set of teammates and opponents to play with, and is also used to evolve the internal rewards and hyperparameters of agents and learning process (f). Game-play footage and further exposition of the environment variability can be found in Supplementary Video https://youtu.be/dltN4MxV1RI.
  • Figure 2: Agent architecture and benchmarking.(a) Shown is how the agent processes a temporal sequence of observations $\hbox{\boldmath $\bf x$}_t$ from the environment. The model operates at two different time scales, faster at the bottom, and slower by a factor of $\tau$ at the top. A stochastic vector-valued latent variable is sampled at the fast time scale from distribution $\mathbb{Q}_t$ based on observations $\hbox{\boldmath $\bf x$}_t$. The action distribution $\pi_t$ is sampled conditional on the latent variable at each time step $t$. The latent variable is regularised by the slow moving prior $\mathbb{P}_t$ which helps capture long-range temporal correlations and promotes memory. The network parameters are updated using reinforcement learning based on the agent's own internal reward signal $r_t$, which is obtained from a learnt transformation $\hbox{\boldmath $\bf w$}$ of game points $\rho_{t}$. $\hbox{\boldmath $\bf w$}$ is optimised for winning probability through population based training, another level of training performed at yet a slower time scale than RL. Detailed network architectures are described in Figure \ref{['fig:arch']}. (b) Top: Shown are the Elo skill ratings of the FTW agent population throughout training (blue) together with those of the best baseline agents using hand tuned reward shaping (RS) (red) and game winning reward signal only (black), compared to human and random agent reference points (violet, shaded region shows strength between 10th and 90th percentile). It can be seen that the FTW agent achieves a skill level considerably beyond strong human subjects, whereas the baseline agent's skill plateaus below, and does not learn anything without reward shaping (see Supplementary Materials for evaluation procedure). (b) Bottom: Shown is the evolution of three hyperparameters of the FTW agent population: learning rate, KL weighting, and internal time scale $\tau$, plotted as mean and standard deviation across the population.
  • Figure 3: Knowledge representation and behavioural analysis.(a) The 2D t-SNE embedding of an FTW agent's internal states during game-play. Each point represents the internal state $(\hbox{\boldmath $\bf h$}^p, \hbox{\boldmath $\bf h$}^q)$ at a particular point in the game, and is coloured according to the high-level game state at this time -- the conjunction of four basic CTF situations (b). Colour clusters form, showing that nearby regions in the internal representation of the agent correspond to the same high-level game state. (c) A visualisation of the expected internal state arranged in a similarity-preserving topological embedding (Figure \ref{['fig:ext_neural_response']}). (d) We show distributions of situation conditional activations for particular single neurons which are distinctly selective for these CTF situations, and show the predictive accuracy of this neuron. (e) The true return of the agent's internal reward signal and (f) the agent's prediction, its value function. (g) Regions where the agent's internal two-timescale representation diverges, the agent's surprise. (h) The four-step temporal sequence of the high-level strategy opponent base camping. (i) Three automatically discovered high-level behaviours of agents and corresponding regions in the t-SNE embedding. To the right, average occurrence per game of each behaviour for the FTW agent, the FTW agent without temporal hierarchy (TH), self-play with reward shaping agent, and human subjects (more detail in Figure \ref{['fig:ext_behvaiours']}).
  • Figure 4: Progression of agent during training.Shown is the development of knowledge representation and behaviours of the FTW agent over the training period of 450K games, segmented into three phases (Supplementary Video https://youtu.be/dltN4MxV1RI). Knowledge: Shown is the percentage of game knowledge that is linearly decodable from the agent's representation, measured by average scaled AUCROC across 200 features of game state. Some knowledge is compressed to single neuron responses (Figure \ref{['fig:three']} (a)), whose emergence in training is shown at the top. Relative Internal Reward Magnitude: Shown is the relative magnitude of the agent's internal reward weights of three of the thirteen events corresponding to game points $\rho$. Early in training, the agent puts large reward weight on picking up the opponent flag, whereas later this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of two. Behaviour Probability: Shown are the frequencies of occurrence for three of the 32 automatically discovered behaviour clusters through training. Opponent base camping (red) is discovered early on, whereas teammate following (blue) becomes very prominent midway through training before mostly disappearing. The home base defence behaviour (green) resurges in occurrence towards the end of training, in line with the agent's increased internal penalty for more opponent flag captures. Memory Usage: Shown are heat maps of visitation frequencies for locations in a particular map (left), and locations of the agent at which the top-ten most frequently read memories were written to memory, normalised by random reads from memory, indicating which locations the agent learned to recall. Recalled locations change considerably throughout training, eventually showing the agent recalling the entrances to both bases, presumably in order to perform more efficient navigation in unseen maps, shown more generally in Figure \ref{['fig:ext_dnc']}.
  • Figure S1: Shown are schematics of samples of procedurally generated maps on which agents were trained. In order to demonstrate the robustness of our approach we trained agents on two distinct styles of maps, procedural outdoor maps (top) and procedural indoor maps (bottom).
  • ...and 10 more figures