Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions
Reda Ouhamma, Maryam Kamgarpour
TL;DR
This work studies payoff-based decentralized learning in infinite-horizon discounted two-player zero-sum Markov games and shows how to relax two central assumptions—strong reachability and uniform mixing times—by introducing Tsallis entropy regularization in a Tsallis-smoothed best-response with value iteration (TBRVI). Under the weaker assumption that there exists a single irreducible strategy with finite mixing time, the authors prove a finite-time convergence to an ε-approximate Nash equilibrium and establish a polynomial sample complexity in 1/ε, with rates depending on the diameter-like parameter d_r and the Tsallis regularization parameter η. Key technical contributions include showing lower bounds and Lipschitz properties for Tsallis smoothing, proving drift inequalities for both policy and value updates, and deriving a Nash-gap bound that decomposes into bias, drift, and error terms. The results demonstrate that Tsallis entropy improves exploration and mixing compared to Shannon-entropy-based smoothing, enabling provable convergence without the stringent assumptions of prior work and offering insight into the trade-offs between exploration, mixing, and convergence in multi-agent reinforcement learning. The findings advance the theory of self-play in zero-sum Markov games and open avenues for improving sample complexity, extending to continuous spaces, and validating in real-world settings.
Abstract
We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information. Prior works established finite-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy (not necessarily known) with bounded reachability and mixing time. Our key technical novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.
