Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

Zida Wu; Mathieu Lauriere; Samuel Jia Cong Chua; Matthieu Geist; Olivier Pietquin; Ankur Mehta

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

Zida Wu, Mathieu Lauriere, Samuel Jia Cong Chua, Matthieu Geist, Olivier Pietquin, Ankur Mehta

TL;DR

A deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent is proposed.

Abstract

Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the design of an additional inner-loop replay buffer, the agents can effectively learn to achieve Nash equilibrium from any distribution, mitigating catastrophic forgetting. The resulting policy can be applied to various initial distributions. Numerical experiments on four canonical examples demonstrate our algorithm has better convergence properties than SOTA algorithms, in particular a DRL version of Fictitious Play for population-dependent policies.

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 24 equations, 17 figures, 2 tables, 4 algorithms)

This paper contains 31 sections, 1 theorem, 24 equations, 17 figures, 2 tables, 4 algorithms.

Introduction
Background
MDP for a representative agent.
Classes of policies.
Best Response
Nash equilibrium.
Exploitability
Algorithm
Online Mirror Descent
Q-function update
Inner loop replay buffer
Experiments setup
Exploration
Example 1: Exploration in one room.
Example 2: Exploration in four connected rooms.
...and 16 more sections

Key Result

Theorem 3.1

Denote by $\boldsymbol{\pi}^{k-1}$ the softmax policy learned in iteration $k-1$, i.e., $\pi^{k-1}_n(\cdot|x,\mu) = \operatorname{softmax}(\frac{1}{\tau} \sum_{i=0}^{k-1} Q^i_n(x,\mu,\cdot))$, and by $Q^k$ the state-action value function in iteration $k$. Let $\widetilde{Q}^k=Q^k+\tau \ln \boldsymbo

Figures (17)

Figure 1: Example 1: Exploration in one room. (a): density evolution using the policy learnt by M-OMD, starting from the $\mu_0$ used for (b). (b): exploitability vs training iteration for a single $\mu_0$. (c): average exploitability when training over 5 different $\mu_0$ (single run of each algo.). (d): averaged curve over 5 runs and std dev.
Figure 2: Example 2: Exploration in four connected rooms. (a): density evolution using the policy learnt by M-OMD, starting from the $\mu_0$ used for (b). (b): exploitability vs training iteration for a single $\mu_0$. (c): average exploitability when training over 5 different $\mu_0$ (single run of each algo.). (d): average over 5 runs & std dev.
Figure 3: Exploitability vs iteration number for various buffer sizes, using our M-OMD algorithm, in exploration of four connected room task (see Sec. \ref{['experiment:exploration']}). Small sizes lead to the forgetting of some $\mu_0$ and hence poor performance (see Step 2 in Algo. \ref{['algo: algorithm1']}).
Figure 4: Example 3: Beach bar problem. (a) and (b) show the distribution evolution for 2D and 1D case. (c), (d) and (e): exploitability for 2D case with: (c) when training with fixed $\mu_0$, (d) when training with different $\mu_0$, and (e) when training with different $\mu_0$ and averaging over 5 runs.
Figure 5: Example 4: Linear quadratic model. (a) shows the evolution of the population using the policy learned by the M-OMD algorithm, starting from two Gaussian distribution pairs, and then accumulating into the center of the population. (b) and (c) shows the averaged exploitability obtained during training over one fixed initial distribution and five initial distributions, respectively.
...and 12 more figures

Theorems & Definitions (2)

Theorem 3.1
proof

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

TL;DR

Abstract

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (2)