Table of Contents
Fetching ...

Best-Effort Policies for Robust Markov Decision Processes

Alessandro Abate, Thom Badings, Giuseppe De Giacomo, Francesco Fabiano

TL;DR

It is proved that ORBE policies always exist, their structure is characterized, and an algorithm to compute them with a manageable overhead compared to standard robust value iteration is presented.

Abstract

We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a manageable overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach.

Best-Effort Policies for Robust Markov Decision Processes

TL;DR

It is proved that ORBE policies always exist, their structure is characterized, and an algorithm to compute them with a manageable overhead compared to standard robust value iteration is presented.

Abstract

We study the common generalization of Markov decision processes (MDPs) with sets of transition probabilities, known as robust MDPs (RMDPs). A standard goal in RMDPs is to compute a policy that maximizes the expected return under an adversarial choice of the transition probabilities. If the uncertainty in the probabilities is independent between the states, known as s-rectangularity, such optimal robust policies can be computed efficiently using robust value iteration. However, there might still be multiple optimal robust policies, which, while equivalent with respect to the worst-case, reflect different expected returns under non-adversarial choices of the transition probabilities. Hence, we propose a refined policy selection criterion for RMDPs, drawing inspiration from the notions of dominance and best-effort in game theory. Instead of seeking a policy that only maximizes the worst-case expected return, we additionally require the policy to achieve a maximal expected return under different (i.e., not fully adversarial) transition probabilities. We call such a policy an optimal robust best-effort (ORBE) policy. We prove that ORBE policies always exist, characterize their structure, and present an algorithm to compute them with a manageable overhead compared to standard robust value iteration. ORBE policies offer a principled tie-breaker among optimal robust policies. Numerical experiments show the feasibility of our approach.

Paper Structure

This paper contains 42 sections, 14 theorems, 37 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For any RMDP, the intersection of the sets of optimal robust policies$\Pi^\star$ and best-effort policies$\Pi_\mathrm{BE}$ is nonempty.

Figures (8)

  • Figure 1: Left: An RMDP with two states, where the policy is fully defined by the probability $\beta \coloneqq \pi(s_1,a_1)$ of choosing $a_1$ in $s_1$. The reward function is defined as $R(s_1,a_1) = R(s_1,a_2) = 0$ and $R(s_2, a) = 1$. Right: The expected return $\rho^\beta_\xi$ as a function of $\beta$ and $\xi \in [0,0.5]$. All policies are optimal robust, but only the policy with $\beta = 0$ is best-effort.
  • Figure 2: The value function $Z^\pi_{P,s_1}$ in state $s_1$ for the RMDP from \ref{['fig:rmdp1']}, shown for the policies with $\beta=1$ (left) and $\beta=0$ (right). The curved lines show the expected return as the parameter $\xi$ in \ref{['fig:rmdp1']} ranges from $0$ to $0.5$ (the line markers correspond with those on the $\xi$-axis in \ref{['fig:rmdp1']}).
  • Figure 3: The directional derivative $\nabla_\mathbf{v} Z^\pi_{P,s_1}$ for $\beta = 0$ (shown in the right half) is strictly larger than for any $\beta > 0$. Hence, we conclude that the policy for $\beta = 0$ is $\mathrm{ORBE}$.
  • Figure 4: Structure of the policy space in an RMDP. The gray ellipse represents the set of all policies admissible in the RMDP. The orange region denotes the set of optimal robust ($\Pi^\star$), while the blue region indicates the set of best-effort policies ($\Pi^\star_\mathrm{BE}$). The area where the two regions overlap corresponds to the $\mathrm{ORBE}$ policies ($\Pi^\star \cap \Pi_\mathrm{BE} = \Pi^\star_\mathrm{BE}$).
  • Figure 5: Visualization of the proof of \ref{['thm:char:complete']} for a convex polytopic uncertainty set $\mathcal{P}_{\bar{s}}$ over three states. The line segment between $P^{(1)}_{\bar{s}}$ and $P^{(2)}_{\bar{s}}$ is shown in red. The color shade in the polytope depicts the difference $L^{\pi^\star,\pi'}_{P,\bar{s}}(x)$ in value between the policies $\pi^\star$ and $\pi'$. Red means $L^{\pi^\star,\pi'}_{P,\bar{s}}(x) < 0$, white means $L^{\pi^\star,\pi'}_{P,\bar{s}}(x) = 0$, and green means $L^{\pi^\star,\pi'}_{P,\bar{s}}(x) > 0$. Because $L^{\pi^\star,\pi'}_{P,\bar{s}}(x)$ is zero along the line segment and intersects the interior of the uncertainty set $\mathcal{P}_{\bar{s}}$, for every point $P'$ where $L^{\pi^\star,\pi'}_{P,\bar{s}}(x) < 0$ ($\pi'$ outperforms $\pi^\star$), there exists another point $P"$ where $L^{\pi^\star,\pi'}_{P,\bar{s}}(x) > 0$ ($\pi'$ performs worse than $\pi^\star$). In particular, this point $P" \in \mathcal{P}_{\bar{s}}$ can be chosen to be any point such that the line through $P'$ and $P"$ is perpendicular to the line through $P^{(1)}_{\bar{s}}$ and $P^{(2)}_{\bar{s}}$.
  • ...and 3 more figures

Theorems & Definitions (36)

  • Definition 1: MDP
  • Definition 2: RMDP
  • Definition 3: Rectangularity
  • Definition 4: Dominance
  • Definition 5: Strict dominance
  • Definition 6: Best-effort
  • Remark 1
  • Definition 7: Partial transition function
  • Definition 8: Parametric value function
  • Example 1
  • ...and 26 more