Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Na Li; Yuchen Jiao; Hangguan Shan; Shefeng Yan

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

TL;DR

This work tackles memory and sample efficiency in two-player zero-sum Markov games by introducing ME-Nash-QL, a model-free self-play algorithm that preserves a Markov Nash policy while achieving near-optimal space and computational complexity. It combines a memory-efficient Q-learning backbone with a reference-advantage decomposition and an early-settlement mechanism, enabling a sample complexity of $\widetilde{O}(H^4SAB/\varepsilon^2)$ and burn-in $O(SABH^{10})$, plus a polynomial-time policy computation via Coarse Correlated Equilibrium. The results extend to multi-player general-sum MGs with a corresponding but more demanding sample complexity, demonstrating a scalable path toward efficient, principled multi-agent learning. Overall, ME-Nash-QL advances the theoretical understanding of model-free MARL by delivering both memory and computation efficiency alongside Nash/Markov guarantees in tabular TZMGs.

Abstract

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

TL;DR

and burn-in

, plus a polynomial-time policy computation via Coarse Correlated Equilibrium. The results extend to multi-player general-sum MGs with a corresponding but more demanding sample complexity, demonstrating a scalable path toward efficient, principled multi-agent learning. Overall, ME-Nash-QL advances the theoretical understanding of model-free MARL by delivering both memory and computation efficiency alongside Nash/Markov guarantees in tabular TZMGs.

Abstract

-approximate Nash policy with space complexity

and sample complexity

, where

is the number of states,

is the number of actions for two players, and

is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when

. Second, ME-Nash-QL achieves the lowest computational complexity

while preserving Markov policies, where

is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost

, whereas previous algorithms have a burn-in cost of at least

to attain the same level of sample complexity with ours.

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

TL;DR

Abstract

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (24)