Table of Contents
Fetching ...

Quantile Markov Decision Process

Xiaocheng Li, Huaiyang Zhong, Margaret L. Brandeau

TL;DR

This paper introduces Quantile Markov Decision Processes (QMDP), a framework for optimizing the quantiles of cumulative rewards in sequential decisions. By augmenting the state with a quantile-based statistic, it enables a dynamic programming approach that yields the optimal value for all quantiles in a single pass and extends to CVaR objectives. The paper provides theoretical characterizations, an efficient OPT-based algorithm, and complexity analysis, and demonstrates practical relevance through synthetic studies and an HIV treatment initiation case. The work offers distributional insight and risk-aware decision tools that complement traditional MDPs, with potential applicability across healthcare, finance, and service systems.

Abstract

The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. In this paper we consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP), which we refer to as a quantile Markov decision process (QMDP). We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy. The algorithm also extends to the MDP problem with a conditional value-at-risk (CVaR) objective. We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.

Quantile Markov Decision Process

TL;DR

This paper introduces Quantile Markov Decision Processes (QMDP), a framework for optimizing the quantiles of cumulative rewards in sequential decisions. By augmenting the state with a quantile-based statistic, it enables a dynamic programming approach that yields the optimal value for all quantiles in a single pass and extends to CVaR objectives. The paper provides theoretical characterizations, an efficient OPT-based algorithm, and complexity analysis, and demonstrates practical relevance through synthetic studies and an HIV treatment initiation case. The work offers distributional insight and risk-aware decision tools that complement traditional MDPs, with potential applicability across healthcare, finance, and service systems.

Abstract

The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. In this paper we consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP), which we refer to as a quantile Markov decision process (QMDP). We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy. The algorithm also extends to the MDP problem with a conditional value-at-risk (CVaR) objective. We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.

Paper Structure

This paper contains 30 sections, 10 theorems, 54 equations, 11 figures, 2 algorithms.

Key Result

Lemma 1

For a given random variable $X$, $Q_\tau(X)$ is a left continuous and non-decreasing function of $\tau.$ Additionally,

Figures (11)

  • Figure 1: Comparison of MDP and QMDP value functions. Each plot is obtained from a different initialization of the model parameters. The red lines are the optimal quantile rewards computed via QMDP. The gray dashed lines are the cumulative density function for simulations with the execution of the optimal MDP policy.
  • Figure 2: Illustration of backward dynamic program for computing $v_t$ from $v_{t+1}$. Here $p_{i}^{(k)}=\mathbb{P}(S_{t+1}=s_k|S_t=s, a=a_k).$ Without loss of generality, the immediate reward $r_t(S_t=s, a_k)$ is ignored in the schematic.
  • Figure 3: QMDP optimal value function and optimal policy for two-period gambling game.
  • Figure 4: Step-by-step execution of Algorithm \ref{['algoForOPT']} with $n=3$ sample $g(\cdot,\cdot)$ functions. Numbers inside and along the blocks represent the values and breakpoints of the input functions $g(i,\cdot)$. The shaded regions reflect the progress of the algorithm. In the end, the output is $f.$
  • Figure 5: Illustration of the simple QMDP model.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Definition 1
  • Lemma 1
  • Theorem 1: Value Function Dynamic Programming
  • Lemma 2
  • Theorem 2: Optimal Value Function
  • Theorem 3: Optimal Policy
  • Proposition 1
  • Definition 2
  • Theorem 4: CVaR Value Function
  • Theorem 5
  • ...and 2 more