Table of Contents
Fetching ...

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

Zida Wu, Mathieu Lauriere, Samuel Jia Cong Chua, Matthieu Geist, Olivier Pietquin, Ankur Mehta

TL;DR

A deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent is proposed.

Abstract

Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the design of an additional inner-loop replay buffer, the agents can effectively learn to achieve Nash equilibrium from any distribution, mitigating catastrophic forgetting. The resulting policy can be applied to various initial distributions. Numerical experiments on four canonical examples demonstrate our algorithm has better convergence properties than SOTA algorithms, in particular a DRL version of Fictitious Play for population-dependent policies.

Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

TL;DR

A deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent is proposed.

Abstract

Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the design of an additional inner-loop replay buffer, the agents can effectively learn to achieve Nash equilibrium from any distribution, mitigating catastrophic forgetting. The resulting policy can be applied to various initial distributions. Numerical experiments on four canonical examples demonstrate our algorithm has better convergence properties than SOTA algorithms, in particular a DRL version of Fictitious Play for population-dependent policies.
Paper Structure (31 sections, 1 theorem, 24 equations, 17 figures, 2 tables, 4 algorithms)

This paper contains 31 sections, 1 theorem, 24 equations, 17 figures, 2 tables, 4 algorithms.

Key Result

Theorem 3.1

Denote by $\boldsymbol{\pi}^{k-1}$ the softmax policy learned in iteration $k-1$, i.e., $\pi^{k-1}_n(\cdot|x,\mu) = \operatorname{softmax}(\frac{1}{\tau} \sum_{i=0}^{k-1} Q^i_n(x,\mu,\cdot))$, and by $Q^k$ the state-action value function in iteration $k$. Let $\widetilde{Q}^k=Q^k+\tau \ln \boldsymbo

Figures (17)

  • Figure 1: Example 1: Exploration in one room. (a): density evolution using the policy learnt by M-OMD, starting from the $\mu_0$ used for (b). (b): exploitability vs training iteration for a single $\mu_0$. (c): average exploitability when training over 5 different $\mu_0$ (single run of each algo.). (d): averaged curve over 5 runs and std dev.
  • Figure 2: Example 2: Exploration in four connected rooms. (a): density evolution using the policy learnt by M-OMD, starting from the $\mu_0$ used for (b). (b): exploitability vs training iteration for a single $\mu_0$. (c): average exploitability when training over 5 different $\mu_0$ (single run of each algo.). (d): average over 5 runs & std dev.
  • Figure 3: Exploitability vs iteration number for various buffer sizes, using our M-OMD algorithm, in exploration of four connected room task (see Sec. \ref{['experiment:exploration']}). Small sizes lead to the forgetting of some $\mu_0$ and hence poor performance (see Step 2 in Algo. \ref{['algo: algorithm1']}).
  • Figure 4: Example 3: Beach bar problem. (a) and (b) show the distribution evolution for 2D and 1D case. (c), (d) and (e): exploitability for 2D case with: (c) when training with fixed $\mu_0$, (d) when training with different $\mu_0$, and (e) when training with different $\mu_0$ and averaging over 5 runs.
  • Figure 5: Example 4: Linear quadratic model. (a) shows the evolution of the population using the policy learned by the M-OMD algorithm, starting from two Gaussian distribution pairs, and then accumulating into the center of the population. (b) and (c) shows the averaged exploitability obtained during training over one fixed initial distribution and five initial distributions, respectively.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof