ε-Optimally Solving Two-Player Zero-Sum POSGs
Erwan Christian Escudie, Matthia Sabatelli, Olivier Buffet, Jilles Steeve Dibangoye
TL;DR
This work addresses the challenge of solving two-player zero-sum POSGs by introducing the first lossless reduction to transition-independent zs-SGs, enabling principled application of dynamic-programming techniques. The method relies on occupancy-set representations and a hierarchical planner framework (focal, uninformed, informed, and marginal planners) to preserve value and equilibrium structure while enabling DP over a structured state space. A key theoretical result is that the reduced model admits a minimax value and uniform continuity of the value function, which is exploited by a PBVI algorithm to compute $\varepsilon$-optimal strategies, with an explicit bound $\varepsilon \leq \dfrac{4m\delta}{(1-\gamma)^2} [1+(\ell+1)\gamma^{\ell+2}-(\ell+2)\gamma^{\ell+1}]$ depending on sample density $\delta$. Empirically, PBVI variants outperform HSVI and CFR+ on challenging benchmarks while scaling to horizons up to $\ell=10$, demonstrating the practical viability of transfering DP theory from SGs to partially observable settings. Overall, the paper provides a principled pathway for unifying solution methods across cooperative, competitive, and mixed-motive POSGs through occupancy-based reductions and a structured planning hierarchy.
Abstract
We present a novel framework for ε-optimally solving two-player zero-sum partially observable stochastic games (zs-POSGs). These games pose a major challenge due to the absence of a principled connection with dynamic programming (DP) techniques developed for two-player zero-sum stochastic games (zs-SGs). Prior attempts at transferring solution methods have lacked a lossless reduction, defined here as a transformation that preserves value functions, equilibrium strategies, and optimality structure, thereby limiting generalisation to ad-hoc algorithms. This work introduces the first lossless reduction from zs-POSGs to transition-independent zs-SGs, enabling the principled application of a broad class of DP-based methods. We show empirically that point-based value iteration (PBVI) algorithms, applied via this reduction, produce ε-optimal strategies across a range of benchmark domains, consistently matching or outperforming existing state-of-the-art methods. Our results open a systematic pathway for algorithmic and theoretical transfer from SGs to partially observable settings.
