Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Yuling Yan; Gen Li; Yuxin Chen; Jianqing Fan

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan

TL;DR

This work tackles offline learning of Nash equilibria in two-player zero-sum Markov games by proposing VI-LCB-Game, a pessimistic model-based algorithm that operates on an empirical Markov game constructed from offline data. The method employs Bernstein-style lower confidence penalties to produce conservative Q-value estimates and to compute state-wise Nash equilibria, yielding an overall NE with duality gap at most $\varepsilon$. The authors establish a minimax-optimal sample complexity of $\widetilde{O}\left(\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^3\varepsilon^2}\right)$, valid for any $\varepsilon\in(0,1/(1-\gamma)]$, and prove a matching lower bound up to logarithmic factors. A key advance is the linear-in-$A+B$ dependence (avoiding the curse of multiplying actions) and the algorithm’s simplicity, as it avoids variance reduction and sample splitting, while relying on a unilateral data-coverage notion via the clipped unilateral concentrability $C_{\mathsf{clipped}}^{\star}$ to handle distribution shifts in offline data.

Abstract

This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a $γ$-discounted infinite-horizon Markov game with $S$ states, where the max-player has $A$ actions and the min-player has $B$ actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an $\varepsilon$-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-γ)^{3}\varepsilon^{2}}$ (up to some log factor). Here, $C_{\mathsf{clipped}}^{\star}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy $\varepsilon$ can be any value within $\big(0,\frac{1}{1-γ}\big]$. Our sample complexity bound strengthens prior art by a factor of $\min\{A,B\}$, achieving minimax optimality for the entire $\varepsilon$-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

TL;DR

. The authors establish a minimax-optimal sample complexity of

, valid for any

, and prove a matching lower bound up to logarithmic factors. A key advance is the linear-in-

dependence (avoiding the curse of multiplying actions) and the algorithm’s simplicity, as it avoids variance reduction and sample splitting, while relying on a unilateral data-coverage notion via the clipped unilateral concentrability

to handle distribution shifts in offline data.

Abstract

This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a

-discounted infinite-horizon Markov game with

states, where the max-player has

actions and the min-player has

actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an

-approximate Nash equilibrium with a sample complexity no larger than

(up to some log factor). Here,

is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy

can be any value within

. Our sample complexity bound strengthens prior art by a factor of

, achieving minimax optimality for the entire

-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

Paper Structure (63 sections, 11 theorems, 227 equations, 1 algorithm)

This paper contains 63 sections, 11 theorems, 227 equations, 1 algorithm.

Introduction
Data coverage for offline Markov games.
An overview of main results.
Notation.
Problem formulation
Preliminaries
Zero-sum two-player Markov games.
Policy, value function, Q-function, and occupancy distribution.
Nash equilibrium.
Offline dataset (batch dataset)
Algorithm and main theory
Algorithm design
The empirical Markov game.
Pessimistic Bellman operators.
Pessimistic value iteration with Bernstein-style penalty.
...and 48 more sections

Key Result

Theorem 1

Consider any initial state distribution $\rho\in\Delta(\mathcal{S})$, and suppose that Assumption assumption:uniliteral holds. Assume that $1/2\leq\gamma<1$, and consider any $\delta\in(0,1)$ and $\varepsilon\in(0,\frac{1}{1-\gamma}]$. Then with probability exceeding $1-\delta$, the policy pair $(\w as long as the sample size exceeds for some sufficiently large constant $c_{1}>0$.

Theorems & Definitions (27)

Theorem 1
Remark 1
Remark 2
Theorem 2
Remark 3
Theorem 3
Lemma 1
proof
Lemma 2
proof
...and 17 more

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

TL;DR

Abstract

Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)