Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Clarisse Wibault; Johannes Forkel; Sebastian Towers; Tiphaine Wibault; Juan Duque; George Whittle; Andreas Schaab; Yucheng Yang; Chiyuan Wang; Michael Osborne; Benjamin Moll; Jakob Foerster

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Michael Osborne, Benjamin Moll, Jakob Foerster

TL;DR

Recurrent Structural Policy Gradient (RSPG) is proposed, the first history-aware HSM for settings involving public information and MFAX, the authors' JAX-based framework for MFGs, achieves state-of-the-art performance as well as an order-of-magnitude faster convergence.

Abstract

Mean Field Games (MFGs) provide a principled framework for modeling interactions in large population models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, or common noise. However, algorithmic progress has been limited since model-free methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for settings involving public information. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: https://github.com/CWibault/mfax.

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

TL;DR

Abstract

Paper Structure (54 sections, 40 equations, 11 figures, 9 tables, 4 algorithms)

This paper contains 54 sections, 40 equations, 11 figures, 9 tables, 4 algorithms.

Introduction
Preliminaries
Mean Field Games (MFGs) with Common Noise
Unified Taxonomy of DP, RL & HSMs for MFGs
Dynamic Programming (DP)
Reinforcement Learning (RL)
Related Work
Problem Setting
Algorithms
Partially Observable Mean Field Games with Common Noise
The Special Case of Shared Observations
Methods
Recurrent Structural Policy Gradient (RSPG)
Advantages & Limitations of HSMs and RL
MFAX
...and 39 more sections

Figures (11)

Figure 1: Top left: the analytic mean-field update computes the exact expectation over next states. Bottom left: a sample-based mean-field update re-approximates the mean-field at each step, by tracking individual agents. Right: network architecture for the reduced policy used in RSPG. The hidden state is independent of the individual state such that the analytic mean-field update has the same asymptotic computational cost as a memoryless policy. If the actions are continuous, an underlying continuous distribution is parameterised. The log-probabilities are evaluated at equal intervals along the action space, and used as logits for a categorical distribution. This structured prior induces ordinality in the action space.
Figure 2: Exploitability versus training wall-clock time for partially observable Linear Quadratic, Beach Bar, and Macroeconomics environments. All experiments were conducted on NVIDIA L40S GPUs (48 GB). HSMs (SPG, RSPG) are an order of magnitude faster than RL methods, with history-aware RSPG consistently achieving among the lowest exploitability. Shaded regions indicate 95%ile CI for the mean over 10 seeds.
Figure 3: Heatmaps:mean-field distribution (income on y-axis and wealth on x-axis) at specific timesteps during the episode for the Macroeconomics environment (with total episode length of 128 steps). Interest rates (first column) and wages (second column) are determined by the mean-field distribution. The environment is implemented as a finite horizon: with RSPG, agents learn anticipatory behaviour, spending more wealth just before the end of the episode, pushing interest rates up and wages down. This is not the case for SPG, which is memoryless.
Figure 4: Mean-field distribution (y-axis) versus time (x-axis) for the Beach Bar environment. Agents are rewarded for being next to the bar when it is open, and penalised for being directly next to the bar when it is closed, or just before it closes, which can occur halfway through the episode (white-line). With RSPG and RIPPO, agents learn to apprehend the time at which the bar might close, moving away from the bar just before potential closure, and back towards it if it stays open (as above).
Figure 5: Mean-field distribution (y-axis) versus time (x-axis) for the Beach Bar environment (top). Learned policy (middle) versus best response policy (bottom). Agents are rewarded for being next to the bar when it is open, and penalised for being directly next to the bar when it is closed, or just before it closes, which can occur halfway through the episode (white-line). Here the bar stays open, which is why agents move back towards the bar.
...and 6 more figures

Theorems & Definitions (3)

Definition 2.1
Definition 2.2
Definition 5.1

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

TL;DR

Abstract

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Authors

TL;DR

Abstract

Table of Contents

Figures (11)

Theorems & Definitions (3)