Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
Emile Anand, Ishani Karmarkar, Guannan Qu
TL;DR
This work tackles the MARL curse of dimensionality by introducing SUBSAMPLE-MFQ, which learns policies for a cooperative system with one global and $n$ local agents by subsampling a subset of size $k$. By performing mean-field Q-learning on the $k$-agent surrogate and executing with a randomized policy that samples subsets at runtime, the method achieves a high-probability bound showing the policy’s optimality gap decays at a rate of $\tilde{O}(1/\sqrt{k})$, independent of $n$. The key analytical ingredients are a Lipschitz bound in total variation, a sampling-without-replacement concentration bound for the empirical distribution, and an adapted performance-difference lemma, enabling a polynomial-in-$k$ learning procedure with scalable guarantees. The approach yields exponential speedups in $n$ when $k=O(\log n)$ and extends to off-policy and linear-MDP-like non-tabular settings, signaling practical scalability gains for cooperative MARL and potential CTDE-type deployments. Overall, the paper contributes a theoretically solid subsampling framework that preserves near-optimality while dramatically reducing sample and computational complexity in large-scale multi-agent systems.
Abstract
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.
