Table of Contents
Fetching ...

Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality

Julien Grand-Clément, Marek Petrik, Nicolas Vieille

TL;DR

This paper extends robust MDP analysis beyond discounted objectives to average and Blackwell optimality, revealing a clear dichotomy between sa-rectangular and s-rectangular uncertainty. It proves that average optimal policies exist and can be chosen stationary/deterministic for sa-rectangular RMDPs, while for s-rectangular RMDPs average optimizers may fail to exist or require history-dependent strategies. It further shows ε-Blackwell optimal policies always exist for sa-rectangular models, and under definable uncertainty sets, Blackwell optimal policies also exist; however, Blackwell optimality may fail in s-rectangular models. The authors connect RMDPs with stochastic games, develop definability-based conditions, and propose several algorithms (including large-discount factor and value-iteration variants) to compute the optimal average value, with empirical validation on multiple testbeds. Overall, the work argues that distance-based sa-rectangular uncertainty models offer robust, practically tractable guarantees for average and Blackwell optimality, guiding practitioners toward more reliable RMDP formulations and solution methods.

Abstract

Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to ). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they exist, may need to be history-dependent (Markovian). We also study Blackwell optimality for sa-rectangular RMDPs, where we show that $ε$-Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.

Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality

TL;DR

This paper extends robust MDP analysis beyond discounted objectives to average and Blackwell optimality, revealing a clear dichotomy between sa-rectangular and s-rectangular uncertainty. It proves that average optimal policies exist and can be chosen stationary/deterministic for sa-rectangular RMDPs, while for s-rectangular RMDPs average optimizers may fail to exist or require history-dependent strategies. It further shows ε-Blackwell optimal policies always exist for sa-rectangular models, and under definable uncertainty sets, Blackwell optimal policies also exist; however, Blackwell optimality may fail in s-rectangular models. The authors connect RMDPs with stochastic games, develop definability-based conditions, and propose several algorithms (including large-discount factor and value-iteration variants) to compute the optimal average value, with empirical validation on multiple testbeds. Overall, the work argues that distance-based sa-rectangular uncertainty models offer robust, practically tractable guarantees for average and Blackwell optimality, guiding practitioners toward more reliable RMDP formulations and solution methods.

Abstract

Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the discounted return, but little is known for average optimality (optimizing the long-run average of the rewards obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently close to ). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but, perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they exist, may need to be history-dependent (Markovian). We also study Blackwell optimality for sa-rectangular RMDPs, where we show that -Blackwell optimal policies always exist, although Blackwell optimal policies may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any examples from the literature. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.
Paper Structure (76 sections, 42 theorems, 115 equations, 9 figures, 1 table, 5 algorithms)

This paper contains 76 sections, 42 theorems, 115 equations, 9 figures, 1 table, 5 algorithms.

Key Result

Proposition 2.2

Let $\mathcal{U}$ be a convex compact s-rectangular uncertainty set. Then

Figures (9)

  • Figure 1: A simple robust MDP instance where $\inf_{\bm{P} \in \mathcal{U}} R_{{\sf avg}}(\pi,\bm{P})$ is not attained.
  • Figure 2: Transitions and rewards for the MDP instance for Proposition . The adversary chooses $p \in [0,1]$ and the decision-maker chooses an action in $\{T,B\}$.
  • Figure 3: Transitions and rewards for The Big Match blackwell1968biggillette1957stochastic reformulated as an s-rectangular RMDP. The adversary chooses $p \in [0,1]$ and the decision-maker chooses an action in $\{T,B\}$.
  • Figure 4: Transitions and rewards for an s-rectangular RMDP instance with no Blackwell optimal policy.
  • Figure 5: Errors to the optimal gain for a box uncertainty set \ref{['eq:uncertainty-set-box']}.
  • ...and 4 more figures

Theorems & Definitions (61)

  • Remark 2.1: History-dependent adversaries
  • Proposition 2.2
  • Example 3.1
  • Proposition 3.2
  • Lemma 3.3: Adapted from Theorem 2.5, bierth1987expected
  • Theorem 3.4
  • Theorem 3.5
  • Theorem 3.6
  • Corollary 3.7
  • Proposition 3.8
  • ...and 51 more