Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Songtao Feng; Ming Yin; Yu-Xiang Wang; Jing Yang; Yingbin Liang

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Songtao Feng, Ming Yin, Yu-Xiang Wang, Jing Yang, Yingbin Liang

TL;DR

This paper tackles the challenge of learning in two-player zero-sum Markov games with strong sample efficiency. It proposes a model-free, stage-based Q-learning algorithm that employs a novel min-gap based reference-advantage decomposition to perform variance reduction while using a coarse correlated equilibrium (CCE) oracle to update policies. The authors prove that, in the tabular episodic setting, the algorithm achieves an $\epsilon$-approximate Nash equilibrium after $K = \widetilde{O}(H^3 S A B / \epsilon^2)$ episodes, matching the minimax-optimal horizon dependence of model-based methods and establishing optimality for model-free approaches (up to the AB term). This result highlights a significant advancement in sample-efficient multi-agent reinforcement learning and demonstrates how horizon-dependent improvements can be achieved without explicit model estimation. The work also introduces rigorous MG-specific techniques to handle the interplay between CCE updates and variance-reduction strategies, with potential implications for broader multi-agent learning settings.

Abstract

The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an $ε$-optimal Nash Equilibrium (NE) with the sample complexity of $O(H^3SAB/ε^2)$, which is optimal in the dependence of the horizon $H$ and the number of states $S$ (where $A$ and $B$ denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the $H$ dependence as model-based algorithms. The main improvement of the dependency on $H$ arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

TL;DR

-approximate Nash equilibrium after

episodes, matching the minimax-optimal horizon dependence of model-based methods and establishing optimality for model-free approaches (up to the AB term). This result highlights a significant advancement in sample-efficient multi-agent reinforcement learning and demonstrates how horizon-dependent improvements can be achieved without explicit model estimation. The work also introduces rigorous MG-specific techniques to handle the interplay between CCE updates and variance-reduction strategies, with potential implications for broader multi-agent learning settings.

Abstract

-optimal Nash Equilibrium (NE) with the sample complexity of

, which is optimal in the dependence of the horizon

and the number of states

(where

and

denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the

dependence as model-based algorithms. The main improvement of the dependency on

arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.

Paper Structure (25 sections, 23 theorems, 127 equations, 3 algorithms)

This paper contains 25 sections, 23 theorems, 127 equations, 3 algorithms.

Introduction
Related Work
Preliminaries
Algorithm Design
Theoretical Analysis
Main Result
Proof Outline
Conclusion
Details of Algorithm \ref{['alg:1-simple']}
Comparison to Existing Algorithms
Notations
Proof of Theorem \ref{['thm:main']}
Proof of Lemma \ref{['app-prop:Q-chain']} (Step I)
Proof of Lemma \ref{['app-lemma:reference']} (Step II)
Proof of Lemma \ref{['app-lemma:Lambda']} (Step IV)
...and 10 more sections

Key Result

Theorem 4.1

For any $\delta\in(0,1)$, let the agents run Algorithm alg:1 for $K$ episodes with $K\geq \widetilde{O}(H^3SAB/\epsilon^2)$. Then, with probability at least $1-\delta$, the output policy $(\mu^\mathrm{out},\nu^\mathrm{out})$ of Algorithm Alg:2 is an $\epsilon$-approximate Nash equilibrium.

Theorems & Definitions (25)

Definition 2.1: $\epsilon$-optimal Nash equilibrium (NE)
Theorem 4.1
Lemma 4.2
Lemma 4.3
Lemma 4.4
Corollary 4.5
Lemma 4.6
Lemma D.1: Restatement of Lemma \ref{['prop:Q-chain']}
Lemma D.2: Restatement of Lemma \ref{['lemma:ref-1']}
Corollary D.3: Restatement of Corollary \ref{['coro:ref-1']}
...and 15 more

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

TL;DR

Abstract

Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)