Solving Long-run Average Reward Robust MDPs via Stochastic Games
Krishnendu Chatterjee, Ehsan Kafshdar Goharshady, Mehrdad Karrabi, Petr Novotný, Đorđe Žikelić
TL;DR
The work addresses long-run average reward robust MDPs with polytopic uncertainty by establishing a linear-time reduction to finite turn-based stochastic games, enabling transfer of algorithmic and complexity results. Leveraging this reduction, the authors prove that threshold problems lie in NP ∩ coNP and present subexponential randomized algorithms, plus a novel policy-iteration-based method, RPPI, for solving these RMDPs without restrictive structural assumptions. RPPI combines discounted-sum TBSG policy iteration with PPE checks to guarantee optimal long-run average policies and shows substantial practical gains over existing value-iteration approaches. The methodology broadens applicability to nonunichain and multichain models and provides a public implementation to facilitate adoption. These results advance both theoretical understanding and scalable computation for robust sequential decision-making under transition-uncertainty.
Abstract
Markov decision processes (MDPs) provide a standard framework for sequential decision making under uncertainty. However, MDPs do not take uncertainty in transition probabilities into account. Robust Markov decision processes (RMDPs) address this shortcoming of MDPs by assigning to each transition an uncertainty set rather than a single probability value. In this work, we consider polytopic RMDPs in which all uncertainty sets are polytopes and study the problem of solving long-run average reward polytopic RMDPs. We present a novel perspective on this problem and show that it can be reduced to solving long-run average reward turn-based stochastic games with finite state and action spaces. This reduction allows us to derive several important consequences that were hitherto not known to hold for polytopic RMDPs. First, we derive new computational complexity bounds for solving long-run average reward polytopic RMDPs, showing for the first time that the threshold decision problem for them is in $NP \cap coNP$ and that they admit a randomized algorithm with sub-exponential expected runtime. Second, we present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is much more efficient in solving long-run average reward polytopic RMDPs compared to state-of-the-art methods based on value iteration.
