Table of Contents
Fetching ...

Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions

Reda Ouhamma, Maryam Kamgarpour

TL;DR

This work studies payoff-based decentralized learning in infinite-horizon discounted two-player zero-sum Markov games and shows how to relax two central assumptions—strong reachability and uniform mixing times—by introducing Tsallis entropy regularization in a Tsallis-smoothed best-response with value iteration (TBRVI). Under the weaker assumption that there exists a single irreducible strategy with finite mixing time, the authors prove a finite-time convergence to an ε-approximate Nash equilibrium and establish a polynomial sample complexity in 1/ε, with rates depending on the diameter-like parameter d_r and the Tsallis regularization parameter η. Key technical contributions include showing lower bounds and Lipschitz properties for Tsallis smoothing, proving drift inequalities for both policy and value updates, and deriving a Nash-gap bound that decomposes into bias, drift, and error terms. The results demonstrate that Tsallis entropy improves exploration and mixing compared to Shannon-entropy-based smoothing, enabling provable convergence without the stringent assumptions of prior work and offering insight into the trade-offs between exploration, mixing, and convergence in multi-agent reinforcement learning. The findings advance the theory of self-play in zero-sum Markov games and open avenues for improving sample complexity, extending to continuous spaces, and validating in real-world settings.

Abstract

We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information. Prior works established finite-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy (not necessarily known) with bounded reachability and mixing time. Our key technical novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.

Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions

TL;DR

This work studies payoff-based decentralized learning in infinite-horizon discounted two-player zero-sum Markov games and shows how to relax two central assumptions—strong reachability and uniform mixing times—by introducing Tsallis entropy regularization in a Tsallis-smoothed best-response with value iteration (TBRVI). Under the weaker assumption that there exists a single irreducible strategy with finite mixing time, the authors prove a finite-time convergence to an ε-approximate Nash equilibrium and establish a polynomial sample complexity in 1/ε, with rates depending on the diameter-like parameter d_r and the Tsallis regularization parameter η. Key technical contributions include showing lower bounds and Lipschitz properties for Tsallis smoothing, proving drift inequalities for both policy and value updates, and deriving a Nash-gap bound that decomposes into bias, drift, and error terms. The results demonstrate that Tsallis entropy improves exploration and mixing compared to Shannon-entropy-based smoothing, enabling provable convergence without the stringent assumptions of prior work and offering insight into the trade-offs between exploration, mixing, and convergence in multi-agent reinforcement learning. The findings advance the theory of self-play in zero-sum Markov games and open avenues for improving sample complexity, extending to continuous spaces, and validating in real-world settings.

Abstract

We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information. Prior works established finite-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy (not necessarily known) with bounded reachability and mixing time. Our key technical novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.
Paper Structure (42 sections, 22 theorems, 143 equations, 1 figure, 1 algorithm)

This paper contains 42 sections, 22 theorems, 143 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

Assume that the players follow Algorithm algorithm:TBRVI with the parameters $\alpha_k = \alpha/(k+h)$ and $\beta_k= \beta/(k+h)$, where $\alpha,h>0$ and $\frac{\alpha}{h} < 1$. Choose $\frac{\beta}{\alpha} \leq \frac{c_\eta\ell_{\eta}^3 (1-\gamma)^2}{6272 \eta^3 |\mathcal{S}|A_{\max}^4}$, where $\ where $k_0=\min \left\{k \geq 0 \mid k \geq \tau_k\right\}$, $\tau_K = t_{\ell_\eta,\beta_k}$, $L_\

Figures (1)

  • Figure 1: Two MDPs with three states: the transitions from states $1$ and $2$ are action independent. The arrows indicate the possible transitions labeled with their probabilities. The left is a single agent MDP with two actions $a$ and $b$ in state $0$. The right is a two-player MDP where the second player has two actions $c$ and $d$ in state $0$.

Theorems & Definitions (38)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 1: Nash Gap bound
  • Corollary 1: Sample Complexity
  • Corollary 2: Rationality
  • Lemma 1: Policy lower bound
  • Lemma 2: Stationary distribution lower bound
  • ...and 28 more