Mutation-Bias Learning in Games

Johann Bauer; Sheldon West; Eduardo Alonso; Mark Broom

Mutation-Bias Learning in Games

Johann Bauer, Sheldon West, Eduardo Alonso, Mark Broom

TL;DR

This work addresses convergence in multi-agent reinforcement learning by formulating two mutation-bias learning variants, MBL-DPU and MBL-LC, that connect to mutation-perturbed replicator dynamics. The authors establish a direct link between the stochastic updates and the ODE system, proving convergence properties and highlighting how the mutation perturbation drives interior equilibria toward Nash equilibria in several settings. Compared to FAQ and WoLF-PHC, MBL-DPU offers stronger analytic guarantees and robustness to increasing dimensionality, while MBL-LC trades some reliability for faster convergence in simpler games. The results demonstrate the value of a dynamical-systems perspective for MARL, enabling transferability of insights and guiding parameter choices for convergence and generalization in practice.

Abstract

We present two variants of a multi-agent reinforcement learning algorithm based on evolutionary game theoretic considerations. The intentional simplicity of one variant enables us to prove results on its relationship to a system of ordinary differential equations of replicator-mutator dynamics type, allowing us to present proofs on the algorithm's convergence conditions in various settings via its ODE counterpart. The more complicated variant enables comparisons to Q-learning based algorithms. We compare both variants experimentally to WoLF-PHC and frequency-adjusted Q-learning on a range of settings, illustrating cases of increasing dimensionality where our variants preserve convergence in contrast to more complicated algorithms. The availability of analytic results provides a degree of transferability of results as compared to purely empirical case studies, illustrating the general utility of a dynamical systems perspective on multi-agent reinforcement learning when addressing questions of convergence and reliable generalisation.

Mutation-Bias Learning in Games

TL;DR

Abstract

Paper Structure (38 sections, 6 theorems, 43 equations, 45 figures, 1 table, 2 algorithms)

This paper contains 38 sections, 6 theorems, 43 equations, 45 figures, 1 table, 2 algorithms.

Introduction
Preliminaries
Finite normal-form games.
Nash equilibrium.
Repeated games, learning and rationality.
Replicator-mutator dynamics.
Mutation-bias learning
MBL with direct policy update (MBL-DPU).
MBL with logistic choice (MBL-LC).
Attracting mutation limits.
Perturbation creates a trade-off between accuracy and speed.
Experimental results
Prisoner's Dilemma (PD).
Zero-sum games---Matching Pennies (MP).
Zero-sum games---Rock-Paper-Scissors (RPS).
...and 23 more sections

Key Result

Proposition 3.1

For every time $T < \infty$, the family of stochastic processes $\{(X^\theta_{ih}(t))_{i,h}\}_{t \geq 0}$ induced by MBL-DPU converges to eq:RMD in the sense that for all $\varepsilon > 0$: where $n_\theta \theta \rightarrow T$ for $\theta \rightarrow 0$, $x(0)$ is a.s. the initial state of the stochastic processes and $\Phi(x(0), \cdot)$ is the unique solution of eq:RMD with $\Phi(x(0), 0) = x(0

Figures (45)

Figure 1: Self-play on the MP game; for 10 different initial conditions. Each subfigure shows the ten trajectories in the projection onto the first components of the players' strategies, in this case the 'defect' strategy, with the first player on the horizontal axis and the second on the vertical axis. Points coloured yellow correspond to earlier points in time, changing over orange and violet to black for later points in time. The position of the game's Nash equilibrium is marked with a blue cross in the projection plane.
Figure 2: Self-play of MBL-DPU on RPS-3, RPS-5 and RPS-9 games, with $M^{-1} = 20$.
Figure 3: Self-play of MBL-LC on RPS-3, RPS-5 and RPS-9 games, with $M^{-1} = \tau = 20$.
Figure 4: Self-play of FAQ-learning on RPS-3, RPS-5 and RPS-9 games, with $\tau = 20$.
Figure 5: Self-play of WoLF-PHC-learning on RPS-3, RPS-5 and RPS-9 games, with initial learning rate $10^{-1}$ for $Q$, win learning rate $1/2 \cdot 10^{-4}$.
...and 40 more figures

Theorems & Definitions (12)

Remark
Proposition 3.1
Remark
Proposition 3.2
Theorem A.1: Norman
Remark A.2
Proposition A.3
proof
Proposition A.4
proof
...and 2 more

Mutation-Bias Learning in Games

TL;DR

Abstract

Mutation-Bias Learning in Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (45)

Theorems & Definitions (12)