A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Junyue Zhang; Yifen Mu

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Junyue Zhang, Yifen Mu

TL;DR

This work studies stochastic games with long-run average payoffs and develops a payoff-based policy-gradient framework grounded in bounded advantage functions and gradient dominance. It proves Lipschitz continuity of individual payoff gradients and a gradient-dominance property, enabling a distributed gradient-estimation scheme via Simultaneous Perturbation Stochastic Approximation (SPSA) under a Regularized Robbins-Monro/mirror-descent architecture with entropic regularization. The proposed algorithm is distributed, relies only on observed payoffs, and converges to a Nash equilibrium with probability one under global neutral stability of all equilibria and the existence of a globally variationally stable NE, with explicit parametric schedules. The paper also discusses limitations, such as asymptotic guarantees applicability to broader game classes and potential extensions to non-asymptotic convergence and zero-sum settings. Overall, it provides a principled, scalable approach to learning in stochastic games with long-run averages and offers a pathway to practical Nash convergence in a wide class of games.

Abstract

Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct a bandit learning algorithm suited for stochastic games with long-run average payoffs. Additionally, we prove that if all players adopt our algorithm, the policy profile employed will asymptotically converge to a Nash equilibrium with probability one, provided that all Nash equilibria are globally neutrally stable and a globally variationally stable Nash equilibrium exists. This condition represents a wide class of games, including monotone games.

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

TL;DR

Abstract

Paper Structure (12 sections, 18 theorems, 111 equations, 1 algorithm)

This paper contains 12 sections, 18 theorems, 111 equations, 1 algorithm.

Introduction
Problem Setup and Preliminaries
Properties of Individual Payoff Functions
The Learning Framework
Regularized Robbins-Monro process
Simultaneous perturbation stochastic approximation
A distributed learning algorithm in stochastic games
Convergence Analysis and Results
Discussion
Proof of Section \ref{['PropertiesofIndividualPayoffFunctions']}
Proof of Section \ref{['TheLearningFramework']}
Proof of Section \ref{['Convergenceanalysisandresults']}

Key Result

Lemma 3.1

The advantage functions $\mathrm{adv}^{\pi}_i(s,a)$ are bounded with respect to the policy profile $\pi$, and so are the average advantage functions $\overline{\mathrm{adv}}^{\pi}_{i}(s,a_i)$.

Theorems & Definitions (39)

Definition 2.1: Stationary policy
Definition 2.2: Nash equilibrium
Lemma 3.1
Theorem 3.2: Policy gradient theorem
Remark 3.1
Lemma 3.3
Theorem 3.4
Theorem 3.5: Gradient dominance property
Theorem 3.6: First-order stationary policies are Nash
Definition 4.1: Regularizer
...and 29 more

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

TL;DR

Abstract

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (39)