Quantile Markov Decision Process
Xiaocheng Li, Huaiyang Zhong, Margaret L. Brandeau
TL;DR
This paper introduces Quantile Markov Decision Processes (QMDP), a framework for optimizing the quantiles of cumulative rewards in sequential decisions. By augmenting the state with a quantile-based statistic, it enables a dynamic programming approach that yields the optimal value for all quantiles in a single pass and extends to CVaR objectives. The paper provides theoretical characterizations, an efficient OPT-based algorithm, and complexity analysis, and demonstrates practical relevance through synthetic studies and an HIV treatment initiation case. The work offers distributional insight and risk-aware decision tools that complement traditional MDPs, with potential applicability across healthcare, finance, and service systems.
Abstract
The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. In this paper we consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP), which we refer to as a quantile Markov decision process (QMDP). We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy. The algorithm also extends to the MDP problem with a conditional value-at-risk (CVaR) objective. We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.
