The Value Problem for Multiple-Environment MDPs with Parity Objective
Krishnendu Chatterjee, Laurent Doyen, Jean-François Raskin, Ocan Sankur
TL;DR
This work analyzes multiple-environment MDPs (MEMDPs) with parity objectives, where a single strategy must perform well across all environments. It establishes tight complexity boundaries: the limit-sure (value $=1$) problem is PSPACE-complete in general but polynomial-time when the number of environments is fixed, while the gap problem for approximate satisfaction is solvable in double-exponential space; pure strategies suffice for almost-sure and limit-sure winning, though randomization can help for quantitative gaps. The authors develop a suite of techniques—revealed-form MEMDPs, purification of non-distinguishing end-components (purge), common end-components, and learning-by-playing—to derive new characterizations for AS and LS parity, construct finite-memory witnesses, and implement an approximation algorithm for the gap problem. Collectively, these results position MEMDPs as a decidable, practically expressive alternative to POMDPs for tasks requiring robust reasoning across multiple uncertain environments, with explicit memory- and complexity-aware strategies.
Abstract
We consider multiple-environment Markov decision processes (MEMDP), which consist of a finite set of MDPs over the same state space, representing different scenarios of transition structure and probability. The value of a strategy is the probability to satisfy the objective, here a parity objective, in the worst-case scenario, and the value of an MEMDP is the supremum of the values achievable by a strategy. We show that deciding whether the value is 1 is a PSPACE-complete problem, and even in P when the number of environments is fixed, along with new insights to the almost-sure winning problem, which is to decide if there exists a strategy with value 1. Pure strategies are sufficient for theses problems, whereas randomization is necessary in general when the value is smaller than 1. We present an algorithm to approximate the value, running in double exponential space. Our results are in contrast to the related model of partially-observable MDPs where all these problems are known to be undecidable.
