Table of Contents
Fetching ...

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao, Yu-Xiang Wang

TL;DR

This work addresses multi-agent reinforcement learning under adaptivity constraints by introducing a policy-elimination framework for two-player zero-sum Markov games. The algorithm builds an absorbing Markov game and uses staged Crude and Fine Exploration to maintain a shrinking version space of candidate Nash policies, achieving a near-optimal batch complexity of $O\left(H+\log\log K\right)$ and a regret of $\widetilde{O}\left(\sqrt{H^{3}S^{2}ABK}\right)$. A matching batch-lower bound is proven, and the framework extends naturally to bandit games and reward-free MARL with comparable efficiency. Additional results include an $\epsilon$-approximate Nash policy in $\widetilde{O}\left(\frac{H^3S^2AB}{\epsilon^2}\right)$ episodes and specialized bandit/reward-free variants with favorable guarantees. The findings advance understanding of MARL under low adaptivity and enable practical deployment in settings with costly policy updates.

Abstract

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $Ω(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

TL;DR

This work addresses multi-agent reinforcement learning under adaptivity constraints by introducing a policy-elimination framework for two-player zero-sum Markov games. The algorithm builds an absorbing Markov game and uses staged Crude and Fine Exploration to maintain a shrinking version space of candidate Nash policies, achieving a near-optimal batch complexity of and a regret of . A matching batch-lower bound is proven, and the framework extends naturally to bandit games and reward-free MARL with comparable efficiency. Additional results include an -approximate Nash policy in episodes and specialized bandit/reward-free variants with favorable guarantees. The findings advance understanding of MARL under low adaptivity and enable practical deployment in settings with costly policy updates.

Abstract

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of , while the batch complexity is only . In the above, denotes the number of states, are the number of actions for the two players respectively, is the horizon and is the number of episodes. Furthermore, we prove a batch complexity lower bound for all algorithms with regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.
Paper Structure (22 sections, 37 theorems, 80 equations, 1 table, 6 algorithms)

This paper contains 22 sections, 37 theorems, 80 equations, 1 table, 6 algorithms.

Key Result

Theorem 4.1

With probability $1-\delta$, Algorithm alg:main will have regret bounded by $\widetilde{O}(\sqrt{H^{2}S^{2}ABT})$, where $T:=KH$ is the number of steps. Furthermore, the batch complexity of Algorithm alg:main is bounded by $O(H+\log\log K)$.

Theorems & Definitions (61)

  • Definition 2.1
  • Remark 2.2
  • Definition 3.1: The absorbing MG $\widetilde{P}$
  • Remark 3.2
  • Theorem 4.1: Regret and batch complexity of Algorithm \ref{['alg:main']}
  • Theorem 4.2: Lower bound
  • Theorem 4.3: Sample complexity
  • Theorem 5.1
  • Theorem 5.2
  • Lemma 6.1
  • ...and 51 more