Table of Contents
Fetching ...

Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

Antonio Ocello, Daniil Tiapkin, Lorenzo Mancini, Mathieu Laurière, Eric Moulines

TL;DR

The paper addresses computing mean-field Nash equilibria in ergodic MF-MDPs using reinforcement learning, introducing MF-TRPO. It develops two algorithms, ExactMFTRPO with non-asymptotic, finite-sample guarantees and SampleBasedMFTRPO with high-probability finite-sample guarantees, achieving a total environment interaction complexity of $\tilde{O}(1/\varepsilon^6)$ for the model-free variant. Theoretical results include a $\tilde{O}(1/L)$ convergence rate for the exact method and a controlled exploitability bound for the model-free version, together with rigorous concentration-based analysis. Empirical results on grid-based crowd modeling corroborate the theoretical findings, demonstrating stable convergence and meaningful mean-field distribution dynamics across scenarios.

Abstract

We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.

Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

TL;DR

The paper addresses computing mean-field Nash equilibria in ergodic MF-MDPs using reinforcement learning, introducing MF-TRPO. It develops two algorithms, ExactMFTRPO with non-asymptotic, finite-sample guarantees and SampleBasedMFTRPO with high-probability finite-sample guarantees, achieving a total environment interaction complexity of for the model-free variant. Theoretical results include a convergence rate for the exact method and a controlled exploitability bound for the model-free version, together with rigorous concentration-based analysis. Empirical results on grid-based crowd modeling corroborate the theoretical findings, demonstrating stable convergence and meaningful mean-field distribution dynamics across scenarios.

Abstract

We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.

Paper Structure

This paper contains 48 sections, 22 theorems, 161 equations, 14 figures, 1 table, 8 algorithms.

Key Result

Proposition 3.1

Suppose Assumption hyp:Lipschitz_continuity holds. Then, there exists a constant $C_{\pi,\mu}\geq0$ such that, for $\mu,\mu^\prime\in\mathcal{P}(\mathcal{S})$, where $\pi_{\mu}$ is the optimal policy associated with the mean-field distribution $\mu$.

Figures (14)

  • Figure 1: Exploitability achieved by the SampleBasedMFTRPO algorithm in the $5 \times 5$ Grid-Based Crowd Modeling game with the bottom-right corner being a point of interest. The left plot corresponds to $\eta = 0.05$, and the right to $\eta = 0.3$ with results averaged over $10$ and $3$ random seeds, respectively.
  • Figure 2: Evolution of the mean field distribution for $\eta = 0.05$ in the $5 \times 5$ Grid-Based Crowd Modeling game with the bottom-right corner being a point of interest. From left to right: step 0, step 10 and step 200.
  • Figure 3: The reading order is (from left to right): Four Rooms Crowd Modeling, Two-Islands-Graph Crowd Modeling, and Four Rooms Crowd Modeling with a point of interest. Solid lines denote $\eta = 0.05$, whereas dashed lines indicate $\eta = 0.3$.
  • Figure 4: Evolution of the mean field distribution for $\eta = 0.05$ in the Four Rooms Crowd Modeling game. From left to right: step 0, step 1000 and step 5000.
  • Figure 5: Evolution of the mean field distribution for $\eta = 0.05$ in the Two-Islands Graph Crowd Modeling game. From left to right: step 0, step 2000 and step 5000.
  • ...and 9 more figures

Theorems & Definitions (43)

  • Definition 2.1: MFNE
  • Definition 2.2
  • Proposition 3.1
  • Proposition 4.1
  • Proposition 4.2: informal
  • Corollary 4.3
  • Proposition 5.1
  • Proposition 5.2: informal
  • Corollary 5.3
  • Theorem 3.1: Theorem 16 in shani2020adaptive
  • ...and 33 more