Solving Long-run Average Reward Robust MDPs via Stochastic Games

Krishnendu Chatterjee; Ehsan Kafshdar Goharshady; Mehrdad Karrabi; Petr Novotný; Đorđe Žikelić

Solving Long-run Average Reward Robust MDPs via Stochastic Games

Krishnendu Chatterjee, Ehsan Kafshdar Goharshady, Mehrdad Karrabi, Petr Novotný, Đorđe Žikelić

TL;DR

The work addresses long-run average reward robust MDPs with polytopic uncertainty by establishing a linear-time reduction to finite turn-based stochastic games, enabling transfer of algorithmic and complexity results. Leveraging this reduction, the authors prove that threshold problems lie in NP ∩ coNP and present subexponential randomized algorithms, plus a novel policy-iteration-based method, RPPI, for solving these RMDPs without restrictive structural assumptions. RPPI combines discounted-sum TBSG policy iteration with PPE checks to guarantee optimal long-run average policies and shows substantial practical gains over existing value-iteration approaches. The methodology broadens applicability to nonunichain and multichain models and provides a public implementation to facilitate adoption. These results advance both theoretical understanding and scalable computation for robust sequential decision-making under transition-uncertainty.

Abstract

Markov decision processes (MDPs) provide a standard framework for sequential decision making under uncertainty. However, MDPs do not take uncertainty in transition probabilities into account. Robust Markov decision processes (RMDPs) address this shortcoming of MDPs by assigning to each transition an uncertainty set rather than a single probability value. In this work, we consider polytopic RMDPs in which all uncertainty sets are polytopes and study the problem of solving long-run average reward polytopic RMDPs. We present a novel perspective on this problem and show that it can be reduced to solving long-run average reward turn-based stochastic games with finite state and action spaces. This reduction allows us to derive several important consequences that were hitherto not known to hold for polytopic RMDPs. First, we derive new computational complexity bounds for solving long-run average reward polytopic RMDPs, showing for the first time that the threshold decision problem for them is in $NP \cap coNP$ and that they admit a randomized algorithm with sub-exponential expected runtime. Second, we present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is much more efficient in solving long-run average reward polytopic RMDPs compared to state-of-the-art methods based on value iteration.

Solving Long-run Average Reward Robust MDPs via Stochastic Games

TL;DR

Abstract

and that they admit a randomized algorithm with sub-exponential expected runtime. Second, we present Robust Polytopic Policy Iteration (RPPI), a novel policy iteration algorithm for solving long-run average reward polytopic RMDPs. Our experimental evaluation shows that RPPI is much more efficient in solving long-run average reward polytopic RMDPs compared to state-of-the-art methods based on value iteration.

Paper Structure (17 sections, 10 theorems, 33 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 17 sections, 10 theorems, 33 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Preliminaries and Models
Reduction to Turn-based Stochastic Games
Background on Turn-based Stochastic Games
Reduction
Discussion and Implications
Algorithm for Long-run Average RMDPs
Experimental Results
Proof of Theorem 2
Step 1: Definition of $\Phi$
Step 1: Definition of $\Psi$
Step 2: Preservation of policy pair values
Step 3: One-to-one correspondence between pure positional policies
Step 4: Conclusion of theorem proof
Proof of Theorem 3
...and 2 more sections

Key Result

Theorem 1

Given a TBSG $\mathcal{G}$, the following equality holds for both long-run average and discounted-sum objectives:

Figures (1)

Figure 1: Runtime comparison on the Contamination Model.

Theorems & Definitions (15)

Theorem 1: Pure positional determinacy
Theorem 2: Correctness
proof : Proof sketch, full proof in Appendix \ref{['app:soundnessproof']}
Theorem 3: Complexity, proof in Appendix \ref{['app:complexityproof']}
Corollary 1
Corollary 2
Corollary 3
Remark 1: Discounted polytopic RMDPs
Corollary 4
Theorem 4
...and 5 more

Solving Long-run Average Reward Robust MDPs via Stochastic Games

TL;DR

Abstract

Solving Long-run Average Reward Robust MDPs via Stochastic Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (15)