Table of Contents
Fetching ...

Understanding Individual Decision-Making in Multi-Agent Reinforcement Learning: A Dynamical Systems Approach

James Rudd-Jones, María Pérez-Ortiz, Mirco Musolesi

TL;DR

This work reframes multi-agent reinforcement learning as coupled stochastic dynamical systems focused on individual agents, enabling stability and sensitivity analysis beyond traditional mean-field approaches. By applying DS tools—such as invariant distributions, Lyapunov exponents, recurrence plots, and fractal dimensions—the authors diagnose how learning updates interact with the environment to produce fixed points, cycles, or chaotic behavior. Experiments on both simple stateless games and the Overcooked environment illustrate how exploration, discounting, and function approximation shape dynamical regimes, offering a principled route to stability-aware MARL design. The proposed framework bridges theory and practice, providing a scalable, agent-centric toolkit to understand and control long-run MARL dynamics.

Abstract

Analysing learning behaviour in Multi-Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to \textit{individual} decision-making. Practitioners frequently tend to study or compare MARL algorithms from a qualitative perspective largely due to the inherent stochasticity in practical algorithms arising from random dithering exploration strategies, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, often rely on mean-field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, might lead to dissonance between analytical predictions and actual realisations of individual trajectories. In this paper, we propose a novel perspective on MARL systems by modelling them as \textit{coupled stochastic dynamical systems}, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we analyse the stability and sensitivity of agent behaviour at individual level, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us, for the first time, to rigorously study MARL dynamics taking into consideration their inherent stochasticity, providing a deeper understanding of system behaviour and practical insights for the design and control of multi-agent learning processes.

Understanding Individual Decision-Making in Multi-Agent Reinforcement Learning: A Dynamical Systems Approach

TL;DR

This work reframes multi-agent reinforcement learning as coupled stochastic dynamical systems focused on individual agents, enabling stability and sensitivity analysis beyond traditional mean-field approaches. By applying DS tools—such as invariant distributions, Lyapunov exponents, recurrence plots, and fractal dimensions—the authors diagnose how learning updates interact with the environment to produce fixed points, cycles, or chaotic behavior. Experiments on both simple stateless games and the Overcooked environment illustrate how exploration, discounting, and function approximation shape dynamical regimes, offering a principled route to stability-aware MARL design. The proposed framework bridges theory and practice, providing a scalable, agent-centric toolkit to understand and control long-run MARL dynamics.

Abstract

Analysing learning behaviour in Multi-Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to \textit{individual} decision-making. Practitioners frequently tend to study or compare MARL algorithms from a qualitative perspective largely due to the inherent stochasticity in practical algorithms arising from random dithering exploration strategies, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, often rely on mean-field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, might lead to dissonance between analytical predictions and actual realisations of individual trajectories. In this paper, we propose a novel perspective on MARL systems by modelling them as \textit{coupled stochastic dynamical systems}, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we analyse the stability and sensitivity of agent behaviour at individual level, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us, for the first time, to rigorously study MARL dynamics taking into consideration their inherent stochasticity, providing a deeper understanding of system behaviour and practical insights for the design and control of multi-agent learning processes.

Paper Structure

This paper contains 20 sections, 21 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between replicator dynamics and realisations of Policy Gradient, Tabular Q-learning, and IDQN with Boltzmann exploration in the four stateless environments. Replicator dynamics are represented by the analytical vector field as well as a realisation in blue from an arbitrary initial condition. Other algorithms realisations are defined by the legend. Realisations start from circular points and end at stars.
  • Figure 2: Stationary distributions calculated from realisations of training for two IDQN agents in Prisoner's Dilemma and Matching Pennies. The two figures on the left are agents using Boltzmann exploration, therefore the parameters can be interpreted as probabilities of taking action $0$. On the Boltzmann plots bin counts are intentionally very low so it is clear where the stationary distribution has density. The two figures on the right are agents using $\epsilon$-greedy exploration. Parameters cannot be interpreted as action probabilities which is why their scale can be much larger.
  • Figure 3: Recurrence plots from a realisation of training two IDQN agents in Prisoner's Dilemma and Matching Pennies. Figure indicates times when the coupled dynamical system of all agents visits the same area in phase space at the time on the $x$-axis and $y$-axis. Intuitively this means locations are marked when $\theta_i \approx \theta_j$ when $i=x, j=y$. The identity band is masked out as when $i=j$ it is always recurrent.
  • Figure 4: Varying $\gamma$, the discounting parameter in IDQN, and $\epsilon_\text{End}$ the end value for $\epsilon$-greedy exploration in IDQN, to understand their impact on the coupled dynamical system attractor via the Max Lyapunov Exponent and fractal dimension $D_2$.
  • Figure 5: Recurrence plot calculated from a realisation of training two IDQN agents in Overcooked.
  • ...and 1 more figures