Table of Contents
Fetching ...

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

Emile Anand, Ishani Karmarkar

TL;DR

This work studies a cooperative Markov game with a global agent and homogeneous local agents in a communication-constrained regime, and proposes an alternating learning framework, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP.

Abstract

Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of $k$ local agent states per time step. We propose an alternating learning framework $(\texttt{ALTERNATING-MARL})$, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an $\widetilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling

TL;DR

This work studies a cooperative Markov game with a global agent and homogeneous local agents in a communication-constrained regime, and proposes an alternating learning framework, where the global agent performs subsampled mean-field -learning against a fixed local policy, and local agents update by optimizing in an induced MDP.

Abstract

Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of local agent states per time step. We propose an alternating learning framework , where the global agent performs subsampled mean-field -learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an -approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.
Paper Structure (32 sections, 54 theorems, 202 equations, 4 figures, 2 tables, 4 algorithms)

This paper contains 32 sections, 54 theorems, 202 equations, 4 figures, 2 tables, 4 algorithms.

Key Result

Theorem 4.2

For all $s\in\cS_g\times\cS_l^n$, if $T\geq\frac{2}{1-\gamma}\log\frac{\tilde{r}\sqrt{k}}{(1-\gamma)}$ in G-LEARN, then

Figures (4)

  • Figure 1: Robot Coordination. This figure illustrates how our communication-constrained framework can be applied for decentralized coordination of large teams of robots. We note that generative AI was used to refine aesthetics of this figure.
  • Figure 2: Decision rule for consecutive best-response iterates using the progress statistic $\Delta_k = V^{\pi_g',\pi_\ell'} - V^{\pi_g,\pi_\ell}$. The middle region $|\Delta_k|\le2\eta$ triggers stopping and provides a $2\eta$-approximate Nash certificate.
  • Figure 3: a) Discounted cumulative rewards for $k < n = 1000$. As $k$ increases, the rewards generally increase and converge, b) As $k$ increases, the runtime required to converge to a $\tilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium blows up. For some values of $k$, the runtime is significantly shorter since ALTERNATING-MARL learns a policy before $N_{\text{steps}}$ iterations, and terminates early.
  • Figure 4: Simulation of the multi-agent warehouse system for subsampling budgets $k = 1, 10, 20, 35$. The heatmap shows the fraction of $n=1000$ robots occupying each of the $5$ zones at each timestep, where darker shades imply a higher concentration. The solid blue line indicates the zone chosen by the dispatcher (global agent), while the dashed black line shows the true population mode (the zone with the most robots). At $k = 1$, the dispatcher's choices differ significantly from the true mode, while at $k = 35$, the dispatcher tracks the mode substantially better, steering resources to the correct zone more than twice as often. The initial agent positions are drawn from a $\mathsf{Dirichlet}(0.3)$ distribution, creating a concentrated starting configuration, and all panels share the same random seed for comparability.

Theorems & Definitions (113)

  • Definition 2.3: Nash Equilibrium
  • Definition 2.4: $\epsilon$-approximate Nash Equilibrium policy
  • Example 2.5: Communication-constrained control
  • Example 2.6: Federated optimization with partial participation
  • Definition 3.1: Empirical distribution function $F_{s_\Delta}$
  • Definition 4.1: Bellman noise
  • Theorem 4.2
  • Lemma 4.3: Controlling the Bellman noise
  • Theorem 4.4: Global agent subsampling result
  • Remark 4.5
  • ...and 103 more