Table of Contents
Fetching ...

Mean-Field Reinforcement Learning without Synchrony

Shan Yang

TL;DR

The Temporal Mean Field framework is constructed from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory.

Abstract

Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution $μ\in Δ(\mathcal{O})$ -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of $N$, and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to $μ$. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution $μ$ from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an $O(1/\sqrt{N})$ finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all $N$ act per step, with approximation error decaying at the predicted $O(1/\sqrt{N})$ rate.

Mean-Field Reinforcement Learning without Synchrony

TL;DR

The Temporal Mean Field framework is constructed from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory.

Abstract

Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of , and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to . We therefore construct the Temporal Mean Field (TMF) framework around the population distribution from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all act per step, with approximation error decaying at the predicted rate.
Paper Structure (48 sections, 4 theorems, 85 equations, 5 figures, 1 algorithm)

This paper contains 48 sections, 4 theorems, 85 equations, 5 figures, 1 algorithm.

Key Result

Theorem 4.4

Under Assumptions asmp:lipschitz--asmp:monotonicity, if the monotonicity constant satisfies where $L_V = (L_r + \gamma L_P R_{\max}/(1-\gamma))/(1-\gamma)$ is the value sensitivity constant (Lemma lem:value_sensitivity), then for any batch size $B \in \{1, \ldots, N\}$, a TMF equilibrium $(\pi^*, \{\mu_t^*\})$ exists and is unique.

Figures (5)

  • Figure 1: SRSG ($N = 100$): per-agent welfare vs. batch size $B$. TMF-PG achieves above $70\%$ of the congestion-free maximum at every $B$; Myopic matches at $B = 1$ but collapses to $0.5$ when all agents choose simultaneously.
  • Figure 2: SRSG ($B = 1$): both welfare standard deviation (a) and trajectory prediction error (b) decay as $O(1/\sqrt{N})$ (dashed reference line), consistent with Theorem \ref{['thm:n_approx']}.
  • Figure 3: DQG: per-agent reward vs. $N$. TMF-PG avoids the cliff penalty on the honey-trap server by anticipating collective load, yielding $10$--$30\%$ higher reward than Myopic.
  • Figure 4: SRSG: $L_1$ forward prediction error across the $N \times B$ grid (40 seeds). For each $N$, error is nearly constant across $B$; for each $B$, error decays as $O(1/\sqrt{N})$ (dashed reference line).
  • Figure 5: DQG sensitivity analysis ($N = 50$). TMF-PG consistently outperforms Myopic across all tested cliff intensities and horizons.

Theorems & Definitions (11)

  • Definition 2.1: Decision protocol
  • Definition 3.1: TMF Dynamic
  • Remark 3.2
  • Definition 3.3: TMF Equilibrium
  • Theorem 4.4: Existence and uniqueness
  • Theorem 4.5: $N$-agent approximation
  • Remark 4.6: Batch size and passive coupling
  • Theorem 5.1: Convergence of TMF-PG
  • Lemma 2.1: Value sensitivity
  • proof
  • ...and 1 more