Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao; Yu-Xiang Wang

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Dan Qiao, Yu-Xiang Wang

TL;DR

This work addresses multi-agent reinforcement learning under adaptivity constraints by introducing a policy-elimination framework for two-player zero-sum Markov games. The algorithm builds an absorbing Markov game and uses staged Crude and Fine Exploration to maintain a shrinking version space of candidate Nash policies, achieving a near-optimal batch complexity of $O\left(H+\log\log K\right)$ and a regret of $\widetilde{O}\left(\sqrt{H^{3}S^{2}ABK}\right)$. A matching batch-lower bound is proven, and the framework extends naturally to bandit games and reward-free MARL with comparable efficiency. Additional results include an $\epsilon$-approximate Nash policy in $\widetilde{O}\left(\frac{H^3S^2AB}{\epsilon^2}\right)$ episodes and specialized bandit/reward-free variants with favorable guarantees. The findings advance understanding of MARL under low adaptivity and enable practical deployment in settings with costly policy updates.

Abstract

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $Ω(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

TL;DR

and a regret of

. A matching batch-lower bound is proven, and the framework extends naturally to bandit games and reward-free MARL with comparable efficiency. Additional results include an

-approximate Nash policy in

episodes and specialized bandit/reward-free variants with favorable guarantees. The findings advance understanding of MARL under low adaptivity and enable practical deployment in settings with costly policy updates.

Abstract

, while the batch complexity is only

. In the above,

denotes the number of states,

are the number of actions for the two players respectively,

is the horizon and

is the number of episodes. Furthermore, we prove a batch complexity lower bound

for all algorithms with

regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Paper Structure (22 sections, 37 theorems, 80 equations, 1 table, 6 algorithms)

This paper contains 22 sections, 37 theorems, 80 equations, 1 table, 6 algorithms.

Introduction
Problem Setup
Main algorithms
Main results
Some discussions
Application to bandit games ($H=S=1$)
Extension to the reward-free case
Proof overview
Conclusion
Extended related work
Missing algorithm: EstimateTransition (Algorithm \ref{['algo_transition_kernel']}) and some explanation
Transition between original MG and absorbing MG
Proof of lemmas regarding Crude Exploration (Algorithm \ref{['alg:crude']})
Proof of lemmas regarding Fine Exploration (Algorithm \ref{['alg:fine']})
Proof of main theorems
...and 7 more sections

Key Result

Theorem 4.1

With probability $1-\delta$, Algorithm alg:main will have regret bounded by $\widetilde{O}(\sqrt{H^{2}S^{2}ABT})$, where $T:=KH$ is the number of steps. Furthermore, the batch complexity of Algorithm alg:main is bounded by $O(H+\log\log K)$.

Theorems & Definitions (61)

Definition 2.1
Remark 2.2
Definition 3.1: The absorbing MG $\widetilde{P}$
Remark 3.2
Theorem 4.1: Regret and batch complexity of Algorithm \ref{['alg:main']}
Theorem 4.2: Lower bound
Theorem 4.3: Sample complexity
Theorem 5.1
Theorem 5.2
Lemma 6.1
...and 51 more

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

TL;DR

Abstract

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (61)