Table of Contents
Fetching ...

A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, Enrique Munoz de Cote

TL;DR

<3-5 sentence high-level summary> The paper addresses learning in multiagent environments under non-stationarity, where opponents adapt their strategies and the target changes over time. It proposes a unifying three-component framework (policy-generating functions, beliefs, and influence) and a five-category taxonomy (ignore, forget, respond to target opponents, learn opponent models, theory of mind) to organize algorithms across observability and opponent adaptation. It synthesizes formal models from multi-armed bandits, reinforcement learning, and game theory, and illustrates the taxonomy with an Iterated Prisoner’s Dilemma example, surveying state-of-the-art algorithms and their theoretical guarantees. The work also discusses domains, strengths/limitations, and open questions, guiding future research toward diverse opponents, dynamic interactions, learning objectives, and real-world applications.</paper_summary>

Abstract

The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research.

A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

TL;DR

<3-5 sentence high-level summary> The paper addresses learning in multiagent environments under non-stationarity, where opponents adapt their strategies and the target changes over time. It proposes a unifying three-component framework (policy-generating functions, beliefs, and influence) and a five-category taxonomy (ignore, forget, respond to target opponents, learn opponent models, theory of mind) to organize algorithms across observability and opponent adaptation. It synthesizes formal models from multi-armed bandits, reinforcement learning, and game theory, and illustrates the taxonomy with an Iterated Prisoner’s Dilemma example, surveying state-of-the-art algorithms and their theoretical guarantees. The work also discusses domains, strengths/limitations, and open questions, guiding future research toward diverse opponents, dynamic interactions, learning objectives, and real-world applications.</paper_summary>

Abstract

The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research.

Paper Structure

This paper contains 58 sections, 11 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: The agent interacts with the environment, performing an action $a$ that affects the state environment according to a function $T$, producing the state $s$. The agent perceives an observation $z$ about the environment (given by a function $Z$) and obtains a reward $r$ (given by a function $R$).
  • Figure 2: A Markov decision process (MDP) with four states $S0, S1, S2, S3$ and two actions $a1, a2$. The arrows denote the tuple: action, transition probability and reward.
  • Figure 3: (a) The automata that describes the TFT strategy, depending of the opponent action (c or d) it transitions between the two states C and D. (b) The automata describing Pavlov strategy, it consists of four states formed by the last action of both agents (CC, CD, DC, DD).
  • Figure 4: A learning agent $\mathcal{A}$ (outside the cloud) and how it models one opponent $\mathcal{O}$ (inside the cloud) exemplifying the 5 categories of how to handle non-stationary behaviour.
  • Figure 5: Diagram of the algorithms (coloured boxes; each colour represent one experimental domain) analysed in this survey divided in 5 categories (dashed lines) on how they handle non-stationarity. We present how they are connected to each other (arrows) and highlight those algorithms that are representative of each category (double box).
  • ...and 2 more figures

Theorems & Definitions (14)

  • Definition 1: Markov decision process
  • Definition 2: Normal-form game
  • Definition 3: Mixed strategy
  • Definition 4: Best response
  • Definition 5: Minimax Strategy.
  • Definition 6: Security level
  • Definition 7: Nash equilibrium; Nash:1950vo
  • Definition 8: Stochastic game
  • Definition 9: Repeated game
  • Definition 10: Policy generating functions
  • ...and 4 more