Table of Contents
Fetching ...

Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Jia Lin Hau, Erick Delage, Esther Derman, Mohammad Ghavamzadeh, Marek Petrik

TL;DR

A new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees is proposed, which leverages a new, simple dynamic program (DP) decomposition for quantile MDPs.

Abstract

In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

TL;DR

A new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees is proposed, which leverages a new, simple dynamic program (DP) decomposition for quantile MDPs.

Abstract

In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

Paper Structure

This paper contains 43 sections, 29 theorems, 168 equations, 10 figures, 2 tables, 2 algorithms.

Key Result

Lemma 2.1

For any $\tilde{x}\in\mathbb{X}$ and $\alpha\in [0,1]$, it holds that $\mathop{\mathrm{argmin}}\limits_{y \in \mathbb{R}} \mathbb{E}[\ell_\alpha(\tilde{x}-y)] = [{\mathfrak{q}}^-_{\alpha}(\tilde{x}), {\mathfrak{q}}^+_{\alpha}(\tilde{x})] \cap \mathbb{R}.$

Figures (10)

  • Figure 1: Policy performance INV2
  • Figure 1: $25\%$-quantile performance
  • Figure 2: Different discretization VaR MDP performance
  • Figure 3: Q-values of \ref{['alg:q-learning']} for all small $\kappa$'s converge to the DP value function.
  • Figure 4: A example used to prove the non-uniqueness of an optimal solution in \ref{['prop:must-strongly-convex']}.
  • ...and 5 more figures

Theorems & Definitions (61)

  • Lemma 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Definition 4.1
  • Lemma 4.2
  • Theorem 4.3
  • Example 4.4: $J$-uniform discretization
  • Proposition 4.5
  • Lemma 4.6
  • Theorem 4.8
  • ...and 51 more