Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Emile Anand; Ishani Karmarkar

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Emile Anand, Ishani Karmarkar

TL;DR

This work studies a cooperative Markov game with a global agent and homogeneous local agents in a communication-constrained regime, and proposes an alternating learning framework, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP.

Abstract

Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of $k$ local agent states per time step. We propose an alternating learning framework $(\texttt{ALTERNATING-MARL})$, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an $\widetilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

TL;DR

-learning against a fixed local policy, and local agents update by optimizing in an induced MDP.

Abstract

homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of

local agent states per time step. We propose an alternating learning framework

, where the global agent performs subsampled mean-field

-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an

-approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.

Paper Structure (32 sections, 54 theorems, 202 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 32 sections, 54 theorems, 202 equations, 4 figures, 2 tables, 4 algorithms.

Introduction
Related Work
Preliminaries
Algorithmic Approach: Alternating MARL
Overview of Approach
Algorithm Description
Theoretical Guarantees and Analysis
Convergence Guarantees on L-LEARN
Convergence Guarantees on ALTERNATING-MARL
Conclusion and Future Work
Mathematical Background and Additional Remarks
Optimality of the Global Agent Policy
Proof of Lipschitz-Continuity Bound
Bounding the Total Variation Distance
Using the Performance Difference Lemma to Bound the Optimality Gap
...and 17 more sections

Key Result

Theorem 4.2

For all $s\in\cS_g\times\cS_l^n$, if $T\geq\frac{2}{1-\gamma}\log\frac{\tilde{r}\sqrt{k}}{(1-\gamma)}$ in G-LEARN, then

Figures (4)

Figure 1: Robot Coordination. This figure illustrates how our communication-constrained framework can be applied for decentralized coordination of large teams of robots. We note that generative AI was used to refine aesthetics of this figure.
Figure 2: Decision rule for consecutive best-response iterates using the progress statistic $\Delta_k = V^{\pi_g',\pi_\ell'} - V^{\pi_g,\pi_\ell}$. The middle region $|\Delta_k|\le2\eta$ triggers stopping and provides a $2\eta$-approximate Nash certificate.
Figure 3: a) Discounted cumulative rewards for $k < n = 1000$. As $k$ increases, the rewards generally increase and converge, b) As $k$ increases, the runtime required to converge to a $\tilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium blows up. For some values of $k$, the runtime is significantly shorter since ALTERNATING-MARL learns a policy before $N_{\text{steps}}$ iterations, and terminates early.
Figure 4: Simulation of the multi-agent warehouse system for subsampling budgets $k = 1, 10, 20, 35$. The heatmap shows the fraction of $n=1000$ robots occupying each of the $5$ zones at each timestep, where darker shades imply a higher concentration. The solid blue line indicates the zone chosen by the dispatcher (global agent), while the dashed black line shows the true population mode (the zone with the most robots). At $k = 1$, the dispatcher's choices differ significantly from the true mode, while at $k = 35$, the dispatcher tracks the mode substantially better, steering resources to the correct zone more than twice as often. The initial agent positions are drawn from a $\mathsf{Dirichlet}(0.3)$ distribution, creating a concentrated starting configuration, and all panels share the same random seed for comparability.

Theorems & Definitions (113)

Definition 2.3: Nash Equilibrium
Definition 2.4: $\epsilon$-approximate Nash Equilibrium policy
Example 2.5: Communication-constrained control
Example 2.6: Federated optimization with partial participation
Definition 3.1: Empirical distribution function $F_{s_\Delta}$
Definition 4.1: Bellman noise
Theorem 4.2
Lemma 4.3: Controlling the Bellman noise
Theorem 4.4: Global agent subsampling result
Remark 4.5
...and 103 more

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

TL;DR

Abstract

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (113)