Risk-averse optimization of total rewards in Markovian models using deviation measures

Christel Baier; Jakob Piribauer; Maximilian Starke

Risk-averse optimization of total rewards in Markovian models using deviation measures

Christel Baier, Jakob Piribauer, Maximilian Starke

TL;DR

The paper addresses risk-averse optimization of accumulated rewards in Markov decision processes by maximizing $\mathbb{E}^{\mathfrak{S}}(\mathit{rew}) - \lambda \cdot \mathrm{DEV}^{\mathfrak{S}}(\mathit{rew})$ for several deviation measures, including $MADPE$, $SMADPE$, $SVPE$, and the threshold-based variant $TBPE$. It shows that $\mathbb{MADPE}$ can yield desirable eventual reward-maximizing schedulers when $\lambda \le \tfrac{1}{2}$, and provides a transportable quadratic-program formulation on an unfolded model to compute the optimum (with an EXPSPACE upper bound for the threshold problem and PP-hardness even for acyclic chains). However, for $SVPE$ the optimal schedulers can still be $ERMin$-type and randomization may be necessary, indicating that $SVPE$ may not resolve the VPE drawback. The paper also introduces a polynomial-time TBPE optimization via unfolding and reports prototypical experiments with PRISM and Gurobi demonstrating practical feasibility on sizable models, supporting applicability of risk-averse planning in real-world MDPs.

Abstract

This paper addresses objectives tailored to the risk-averse optimization of accumulated rewards in Markov decision processes (MDPs). The studied objectives require maximizing the expected value of the accumulated rewards minus a penalty factor times a deviation measure of the resulting distribution of rewards. Using the variance in this penalty mechanism leads to the variance-penalized expectation (VPE) for which it is known that optimal schedulers have to minimize future expected rewards when a high amount of rewards has been accumulated. This behavior is undesirable as risk-averse behavior should keep the probability of particularly low outcomes low, but not discourage the accumulation of additional rewards on already good executions. The paper investigates the semi-variance, which only takes outcomes below the expected value into account, the mean absolute deviation (MAD), and the semi-MAD as alternative deviation measures. Furthermore, a penalty mechanism that penalizes outcomes below a fixed threshold is studied. For all of these objectives, the properties of optimal schedulers are specified and in particular the question whether these objectives overcome the problem observed for the VPE is answered. Further, the resulting algorithmic problems on MDPs and Markov chains are investigated.

Risk-averse optimization of total rewards in Markovian models using deviation measures

TL;DR

The paper addresses risk-averse optimization of accumulated rewards in Markov decision processes by maximizing

for several deviation measures, including

, and the threshold-based variant

. It shows that

can yield desirable eventual reward-maximizing schedulers when

, and provides a transportable quadratic-program formulation on an unfolded model to compute the optimum (with an EXPSPACE upper bound for the threshold problem and PP-hardness even for acyclic chains). However, for

the optimal schedulers can still be

-type and randomization may be necessary, indicating that

may not resolve the VPE drawback. The paper also introduces a polynomial-time TBPE optimization via unfolding and reports prototypical experiments with PRISM and Gurobi demonstrating practical feasibility on sizable models, supporting applicability of risk-averse planning in real-world MDPs.

Abstract

Paper Structure (16 sections, 19 theorems, 14 equations, 4 figures, 1 table)

This paper contains 16 sections, 19 theorems, 14 equations, 4 figures, 1 table.

Introduction
Preliminaries
Mean absolute deviation-penalized expectation
Randomization and optimality of ERMin-schedulers
Sufficiently small parameters $\lambda$
Computing the maximal MADPE
Computational hardness of the MADPE
Semi-deviation measure-penalized expectation
Threshold-based penalty
Prototypical implementation and first experiments
Conclusion
Omitted proofs of Section \ref{['sec:prelim']}
Omitted proofs and calculations of Section \ref{['sec:MAD']}
Computations omitted in Section \ref{['sec:semivariance']}
Omitted proofs of Section \ref{['sec:threshold']}
...and 1 more sections

Key Result

Lemma 0

Let $\mathcal{M} = (S,\mathit{Act}, P , s_{\mathit{ init}}, \mathit{rew}, \mathit{goal})$ be an MDP satisfying Assumption ass:1. Then, for any scheduler $\mathfrak{S}$ there is a reward-based scheduler $\mathfrak{T}$ such that the distribution of the random variable $\mathit{rew}$ is the same under

Figures (4)

Figure 1: Two example MDPs.
Figure 2: Plot of MAD and variance over the expected value for schedulers obtained by choosing $\alpha$ with probability $p\in[0,1]$ in the MDP $\mathcal{M}$ depicted in Figure \ref{['fig:randomization']}.
Figure 3: Two example MDPs for phenomena of the SVPE.
Figure 4: Experimental evaluation of the algorithms for TBPE and MADPE.

Theorems & Definitions (25)

Lemma 0
Example 1
Example 2
Theorem 3
Remark 4
Lemma 4
Theorem 5
Lemma 5
Theorem 6
Theorem 7
...and 15 more

Risk-averse optimization of total rewards in Markovian models using deviation measures

TL;DR

Abstract

Risk-averse optimization of total rewards in Markovian models using deviation measures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (25)