Table of Contents
Fetching ...

ε-Optimally Solving Two-Player Zero-Sum POSGs

Erwan Christian Escudie, Matthia Sabatelli, Olivier Buffet, Jilles Steeve Dibangoye

TL;DR

This work addresses the challenge of solving two-player zero-sum POSGs by introducing the first lossless reduction to transition-independent zs-SGs, enabling principled application of dynamic-programming techniques. The method relies on occupancy-set representations and a hierarchical planner framework (focal, uninformed, informed, and marginal planners) to preserve value and equilibrium structure while enabling DP over a structured state space. A key theoretical result is that the reduced model admits a minimax value and uniform continuity of the value function, which is exploited by a PBVI algorithm to compute $\varepsilon$-optimal strategies, with an explicit bound $\varepsilon \leq \dfrac{4m\delta}{(1-\gamma)^2} [1+(\ell+1)\gamma^{\ell+2}-(\ell+2)\gamma^{\ell+1}]$ depending on sample density $\delta$. Empirically, PBVI variants outperform HSVI and CFR+ on challenging benchmarks while scaling to horizons up to $\ell=10$, demonstrating the practical viability of transfering DP theory from SGs to partially observable settings. Overall, the paper provides a principled pathway for unifying solution methods across cooperative, competitive, and mixed-motive POSGs through occupancy-based reductions and a structured planning hierarchy.

Abstract

We present a novel framework for ε-optimally solving two-player zero-sum partially observable stochastic games (zs-POSGs). These games pose a major challenge due to the absence of a principled connection with dynamic programming (DP) techniques developed for two-player zero-sum stochastic games (zs-SGs). Prior attempts at transferring solution methods have lacked a lossless reduction, defined here as a transformation that preserves value functions, equilibrium strategies, and optimality structure, thereby limiting generalisation to ad-hoc algorithms. This work introduces the first lossless reduction from zs-POSGs to transition-independent zs-SGs, enabling the principled application of a broad class of DP-based methods. We show empirically that point-based value iteration (PBVI) algorithms, applied via this reduction, produce ε-optimal strategies across a range of benchmark domains, consistently matching or outperforming existing state-of-the-art methods. Our results open a systematic pathway for algorithmic and theoretical transfer from SGs to partially observable settings.

ε-Optimally Solving Two-Player Zero-Sum POSGs

TL;DR

This work addresses the challenge of solving two-player zero-sum POSGs by introducing the first lossless reduction to transition-independent zs-SGs, enabling principled application of dynamic-programming techniques. The method relies on occupancy-set representations and a hierarchical planner framework (focal, uninformed, informed, and marginal planners) to preserve value and equilibrium structure while enabling DP over a structured state space. A key theoretical result is that the reduced model admits a minimax value and uniform continuity of the value function, which is exploited by a PBVI algorithm to compute -optimal strategies, with an explicit bound depending on sample density . Empirically, PBVI variants outperform HSVI and CFR+ on challenging benchmarks while scaling to horizons up to , demonstrating the practical viability of transfering DP theory from SGs to partially observable settings. Overall, the paper provides a principled pathway for unifying solution methods across cooperative, competitive, and mixed-motive POSGs through occupancy-based reductions and a structured planning hierarchy.

Abstract

We present a novel framework for ε-optimally solving two-player zero-sum partially observable stochastic games (zs-POSGs). These games pose a major challenge due to the absence of a principled connection with dynamic programming (DP) techniques developed for two-player zero-sum stochastic games (zs-SGs). Prior attempts at transferring solution methods have lacked a lossless reduction, defined here as a transformation that preserves value functions, equilibrium strategies, and optimality structure, thereby limiting generalisation to ad-hoc algorithms. This work introduces the first lossless reduction from zs-POSGs to transition-independent zs-SGs, enabling the principled application of a broad class of DP-based methods. We show empirically that point-based value iteration (PBVI) algorithms, applied via this reduction, produce ε-optimal strategies across a range of benchmark domains, consistently matching or outperforming existing state-of-the-art methods. Our results open a systematic pathway for algorithmic and theoretical transfer from SGs to partially observable settings.

Paper Structure

This paper contains 39 sections, 24 theorems, 74 equations, 7 figures, 3 tables, 3 algorithms.

Key Result

Lemma 1

The reduced game $\mathcal{M}'$ admits a well-defined value $v'_{*}(b)$, which satisfies the minimax identity: $v'_{*}(b) = \min_{\psi_{\textcolor{sthlmRed}{2}} \in \Psi_{\textcolor{sthlmRed}{2}}} \max_{\psi_{\textcolor{sthlmRed}{1}} \in \Psi_{\textcolor{sthlmRed}{1}}} v'_{\psi_{\textcolor{sthlmRed}

Figures (7)

  • Figure 1: A planner hierarchy induced by relaxing information constraints, from the focal to the marginal planner, supporting our theoretical and algorithmic framework.
  • Figure 2: An influence diagram of a transition-independent, two-player, zero-sum stochastic game.
  • Figure 3: Exploitability of PBVI$_k$ across iterations and runtime on Adversarial Tiger and Mabc ($\ell=5$), with CFR+ for comparison. Rightmost plot shows time-to-convergence for PBVI$_k$ on Mabc.
  • Figure 4: A graphical model of a two-player zero-sum partially observable stochastic game. Each triple $z \doteq (z_1, z_2, w)$ comprises private and public observations. The diagram illustrates an influence process over three stages: central nodes represent the hidden states $(s_t)$; the top and bottom rows show the private observations and actions of players 2 and 1, respectively. Observation nodes also include the local payoff: “$+$” denotes a gain for player 1, and “$-$” a loss for player 2. Directed edges indicate probabilistic dependencies: actions influence transitions and observations, while observations inform future actions. The shaded region highlights the hidden environment state from each player’s viewpoint, emphasising the decentralised and asymmetric information structure. This diagram captures the sequential, partially observable, and adversarial nature of zs-POSGs. The underlying dynamics decompose into two functions, the state transition matrices $\{p^{a}_{ss'}\}$ and the observation matrices $\{p^{az}_{s'}\}$, where $p(s',z|s,a) = p^{a}_{ss'}\cdot p^{az}_{s'}$.
  • Figure 5: Generalization across marginals of the value function given by a collection $G = \{\textcolor{sthlmBlue}{g_{\pmb{c}_{2}}}, \textcolor{sthlmRed}{g_{\pmb{c}_{2}}}, \textcolor{sthlmGreen}{g_{\pmb{c}_{2}}}\}$ of linear functions over unknown marginals. Figure A shows no generalization on marginal $\textcolor{sthlmOrange}{m_{2}}$ because $\textcolor{sthlmOrange}{m_{2}} \notin \{\textcolor{sthlmBlue}{m_{2}}, \textcolor{sthlmRed}{m_{2}}, \textcolor{sthlmGreen}{m_{2}}\}$, cf. Theorem \ref{['thm:wiggers']}. Figure B shows generalization over unknown marginal occupancy state $\textcolor{sthlmOrange}{m_{2}}$ from known marginal $\textcolor{sthlmBlue}{m_{2}}$ with offset $\kappa\|\textcolor{sthlmOrange}{x}- \textcolor{sthlmBlue}{\pmb{c}_{2}} \odot \textcolor{sthlmOrange}{m_{2}} \|_1$, cf. Theorem \ref{['thm:delage']}. Best viewed in color.
  • ...and 2 more figures

Theorems & Definitions (39)

  • Definition 2.1
  • Definition 3.1
  • Definition 3.2
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Corollary 1
  • ...and 29 more