Table of Contents
Fetching ...

A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

Mehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar

TL;DR

This work tackles online learning in infinite-horizon two-player zero-sum stochastic games with an unknown transition model and an arbitrary time-adaptive opponent under the average-reward criterion. It introduces PSRL-ZSG, a posterior sampling-based algorithm that iteratively samples transition kernels from the posterior and computes maximin policies via the Bellman equation to guide play, with episode-based exploration driven by doubling criteria. The main contribution is a Bayesian regret bound of $\widetilde{O}(HS\sqrt{AT})$, improving over the prior $\widetilde{O}(\sqrt[3]{DS^2AT^2})$ bound achieved by UCSG under the same assumptions and matching the $T$-dependence lower bound up to logarithms. This approach also handles time-adaptive, history-dependent opponents without strong ergodicity assumptions, offering a simpler, near-optimal alternative to optimism-based methods in stochastic games.

Abstract

In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in $T$.

A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary Opponent

TL;DR

This work tackles online learning in infinite-horizon two-player zero-sum stochastic games with an unknown transition model and an arbitrary time-adaptive opponent under the average-reward criterion. It introduces PSRL-ZSG, a posterior sampling-based algorithm that iteratively samples transition kernels from the posterior and computes maximin policies via the Bellman equation to guide play, with episode-based exploration driven by doubling criteria. The main contribution is a Bayesian regret bound of , improving over the prior bound achieved by UCSG under the same assumptions and matching the -dependence lower bound up to logarithms. This approach also handles time-adaptive, history-dependent opponents without strong ergodicity assumptions, offering a simpler, near-optimal alternative to optimism-based methods in stochastic games.

Abstract

In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here is an upper bound on the span of the bias function, is the number of states, is the number of joint actions and is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in .

Paper Structure

This paper contains 10 sections, 4 theorems, 38 equations, 1 algorithm.

Key Result

Theorem 3.1

Under Assumption ass: finite diameter, Algorithm alg: posterior sampling can achieve regret bound of

Theorems & Definitions (8)

  • Theorem 3.1
  • Lemma 4.1
  • proof
  • Lemma 4.2
  • proof
  • Remark 4.1
  • Lemma 4.3
  • proof