Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan
TL;DR
This work tackles offline learning of Nash equilibria in two-player zero-sum Markov games by proposing VI-LCB-Game, a pessimistic model-based algorithm that operates on an empirical Markov game constructed from offline data. The method employs Bernstein-style lower confidence penalties to produce conservative Q-value estimates and to compute state-wise Nash equilibria, yielding an overall NE with duality gap at most $\varepsilon$. The authors establish a minimax-optimal sample complexity of $\widetilde{O}\left(\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^3\varepsilon^2}\right)$, valid for any $\varepsilon\in(0,1/(1-\gamma)]$, and prove a matching lower bound up to logarithmic factors. A key advance is the linear-in-$A+B$ dependence (avoiding the curse of multiplying actions) and the algorithm’s simplicity, as it avoids variance reduction and sample splitting, while relying on a unilateral data-coverage notion via the clipped unilateral concentrability $C_{\mathsf{clipped}}^{\star}$ to handle distribution shifts in offline data.
Abstract
This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $γ$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-γ)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-γ}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
