Mean-Field Reinforcement Learning without Synchrony

Shan Yang

Mean-Field Reinforcement Learning without Synchrony

Shan Yang

TL;DR

The Temporal Mean Field framework is constructed from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory.

Abstract

Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution $μ\in Δ(\mathcal{O})$ -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of $N$, and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to $μ$. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution $μ$ from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an $O(1/\sqrt{N})$ finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all $N$ act per step, with approximation error decaying at the predicted $O(1/\sqrt{N})$ rate.

Mean-Field Reinforcement Learning without Synchrony

TL;DR

The Temporal Mean Field framework is constructed from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory.

Abstract

-- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of

, and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to

. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution

from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an

finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all

act per step, with approximation error decaying at the predicted

rate.

Paper Structure (48 sections, 4 theorems, 85 equations, 5 figures, 1 algorithm)

This paper contains 48 sections, 4 theorems, 85 equations, 5 figures, 1 algorithm.

Introduction
Preliminaries
Markov Games and Mean-Field RL
Synchrony as a Structural Requirement
The Population Distribution as Mean Field
The Temporal Mean Field Framework
TMF Dynamic
TMF Bellman Equation
TMF Equilibrium
Theoretical Guarantees for TMF
Assumptions
Existence and Uniqueness of the TMF Equilibrium
Finite-Population Approximation
TMF Reinforcement Learning
Convergence.
...and 33 more sections

Key Result

Theorem 4.4

Under Assumptions asmp:lipschitz--asmp:monotonicity, if the monotonicity constant satisfies where $L_V = (L_r + \gamma L_P R_{\max}/(1-\gamma))/(1-\gamma)$ is the value sensitivity constant (Lemma lem:value_sensitivity), then for any batch size $B \in \{1, \ldots, N\}$, a TMF equilibrium $(\pi^*, \{\mu_t^*\})$ exists and is unique.

Figures (5)

Figure 1: SRSG ($N = 100$): per-agent welfare vs. batch size $B$. TMF-PG achieves above $70\%$ of the congestion-free maximum at every $B$; Myopic matches at $B = 1$ but collapses to $0.5$ when all agents choose simultaneously.
Figure 2: SRSG ($B = 1$): both welfare standard deviation (a) and trajectory prediction error (b) decay as $O(1/\sqrt{N})$ (dashed reference line), consistent with Theorem \ref{['thm:n_approx']}.
Figure 3: DQG: per-agent reward vs. $N$. TMF-PG avoids the cliff penalty on the honey-trap server by anticipating collective load, yielding $10$--$30\%$ higher reward than Myopic.
Figure 4: SRSG: $L_1$ forward prediction error across the $N \times B$ grid (40 seeds). For each $N$, error is nearly constant across $B$; for each $B$, error decays as $O(1/\sqrt{N})$ (dashed reference line).
Figure 5: DQG sensitivity analysis ($N = 50$). TMF-PG consistently outperforms Myopic across all tested cliff intensities and horizons.

Theorems & Definitions (11)

Definition 2.1: Decision protocol
Definition 3.1: TMF Dynamic
Remark 3.2
Definition 3.3: TMF Equilibrium
Theorem 4.4: Existence and uniqueness
Theorem 4.5: $N$-agent approximation
Remark 4.6: Batch size and passive coupling
Theorem 5.1: Convergence of TMF-PG
Lemma 2.1: Value sensitivity
proof
...and 1 more

Mean-Field Reinforcement Learning without Synchrony

TL;DR

Abstract

Mean-Field Reinforcement Learning without Synchrony

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)