Table of Contents
Fetching ...

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Emile Anand, Ishani Karmarkar, Guannan Qu

TL;DR

This work tackles the MARL curse of dimensionality by introducing SUBSAMPLE-MFQ, which learns policies for a cooperative system with one global and $n$ local agents by subsampling a subset of size $k$. By performing mean-field Q-learning on the $k$-agent surrogate and executing with a randomized policy that samples subsets at runtime, the method achieves a high-probability bound showing the policy’s optimality gap decays at a rate of $\tilde{O}(1/\sqrt{k})$, independent of $n$. The key analytical ingredients are a Lipschitz bound in total variation, a sampling-without-replacement concentration bound for the empirical distribution, and an adapted performance-difference lemma, enabling a polynomial-in-$k$ learning procedure with scalable guarantees. The approach yields exponential speedups in $n$ when $k=O(\log n)$ and extends to off-policy and linear-MDP-like non-tabular settings, signaling practical scalability gains for cooperative MARL and potential CTDE-type deployments. Overall, the paper contributes a theoretically solid subsampling framework that preserves near-optimality while dramatically reducing sample and computational complexity in large-scale multi-agent systems.

Abstract

Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

TL;DR

This work tackles the MARL curse of dimensionality by introducing SUBSAMPLE-MFQ, which learns policies for a cooperative system with one global and local agents by subsampling a subset of size . By performing mean-field Q-learning on the -agent surrogate and executing with a randomized policy that samples subsets at runtime, the method achieves a high-probability bound showing the policy’s optimality gap decays at a rate of , independent of . The key analytical ingredients are a Lipschitz bound in total variation, a sampling-without-replacement concentration bound for the empirical distribution, and an adapted performance-difference lemma, enabling a polynomial-in- learning procedure with scalable guarantees. The approach yields exponential speedups in when and extends to off-policy and linear-MDP-like non-tabular settings, signaling practical scalability gains for cooperative MARL and potential CTDE-type deployments. Overall, the paper contributes a theoretically solid subsampling framework that preserves near-optimality while dramatically reducing sample and computational complexity in large-scale multi-agent systems.

Abstract

Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm (-ean-ield--learning) and a decentralized randomized policy for a system with agents. For any , our algorithm learns a policy for the system in time polynomial in . We prove that this learned policy converges to the optimal policy on the order of as the number of subsampled agents increases. In particular, this bound is independent of the number of agents .

Paper Structure

This paper contains 30 sections, 57 theorems, 183 equations, 5 figures, 1 table, 6 algorithms.

Key Result

Theorem 4.2

Let $\pi_{k,m}^\mathrm{est}$ denote the learned policy deployed in SUBSAMPLE-MFQ: Execution. Then, for all $s_0\in \cS\coloneqq \cS_g\times\cS_l^n$, we have

Figures (5)

  • Figure 1: Bounded exploration in warehouse accidents, and traffic congestions with Gaussian squeeze.
  • Figure 2: a) Reward optimality gap (log scale) with ${\pi}_{k,m}^\mathrm{est}$ running $300$ iterations. b) Computation time (in minutes) against sampling parameter $k$, for $k\leq n=8$, to learn policy $\hat{\pi}_{k,m}^\mathrm{est}$. c) Discounted cumulative rewards for $k\leq n=50$.
  • Figure 3: Star graph $S_n$
  • Figure 4: Flow of the algorithm and relevant analyses in learning $Q^*$. Here, (1) follows by performing \ref{['algorithm: approx-dense-tolerable-Q-learning-exp']} (SUBSAMPLE-MFQ: Learning) on $\hat{Q}_{k,m}^0$. (2) follows from \ref{['assumption:qest_qhat_error']}. (3) follows from the Lipschitz continuity and total variation distance bounds in \ref{['thm:lip', 'thm:tvd']}. Finally, (4) follows from noting that $\hat{Q}_n^* = Q^*$.
  • Figure 5: Causal graph to demonstrate the dependencies between variables.

Theorems & Definitions (117)

  • Definition 2.3: $\epsilon$-optimal policy
  • Remark 3.1
  • Definition 3.2: Empirical Distribution Function
  • Remark 3.3
  • Definition 4.1
  • Theorem 4.2
  • Lemma 4.3: Controlling the Bellman Noise.
  • Theorem 4.4
  • Remark 4.5
  • Remark 4.6
  • ...and 107 more