Table of Contents
Fetching ...

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

TL;DR

This work tackles memory and sample efficiency in two-player zero-sum Markov games by introducing ME-Nash-QL, a model-free self-play algorithm that preserves a Markov Nash policy while achieving near-optimal space and computational complexity. It combines a memory-efficient Q-learning backbone with a reference-advantage decomposition and an early-settlement mechanism, enabling a sample complexity of $\widetilde{O}(H^4SAB/\varepsilon^2)$ and burn-in $O(SABH^{10})$, plus a polynomial-time policy computation via Coarse Correlated Equilibrium. The results extend to multi-player general-sum MGs with a corresponding but more demanding sample complexity, demonstrating a scalable path toward efficient, principled multi-agent learning. Overall, ME-Nash-QL advances the theoretical understanding of model-free MARL by delivering both memory and computation efficiency alongside Nash/Markov guarantees in tabular TZMGs.

Abstract

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

TL;DR

This work tackles memory and sample efficiency in two-player zero-sum Markov games by introducing ME-Nash-QL, a model-free self-play algorithm that preserves a Markov Nash policy while achieving near-optimal space and computational complexity. It combines a memory-efficient Q-learning backbone with a reference-advantage decomposition and an early-settlement mechanism, enabling a sample complexity of and burn-in , plus a polynomial-time policy computation via Coarse Correlated Equilibrium. The results extend to multi-player general-sum MGs with a corresponding but more demanding sample complexity, demonstrating a scalable path toward efficient, principled multi-agent learning. Overall, ME-Nash-QL advances the theoretical understanding of model-free MARL by delivering both memory and computation efficiency alongside Nash/Markov guarantees in tabular TZMGs.

Abstract

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an -approximate Nash policy with space complexity and sample complexity , where is the number of states, is the number of actions for two players, and is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when . Second, ME-Nash-QL achieves the lowest computational complexity while preserving Markov policies, where is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost , whereas previous algorithms have a burn-in cost of at least to attain the same level of sample complexity with ours.

Paper Structure

This paper contains 86 sections, 16 theorems, 302 equations, 1 table, 5 algorithms.

Key Result

Theorem 1

Consider any $\delta\in(0,1)$, and suppose that $c_{\mathrm{b}}>0$ is chosen to be a sufficiently large universal constant. Then there exists some absolute constant $C_0>0$ such that Algorithm algorithm-main achieves if the number of samples $T$ satisfies with probability at least $1-\delta$.

Theorems & Definitions (24)

  • Definition 1: $\varepsilon$-approximate Nash equilibrium
  • Definition 2: Regret
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • ...and 14 more