Optimistic Reinforcement Learning with Quantile Objectives

Mohammad Alipour-Vaezi; Huaiyang Zhong; Kwok-Leung Tsui; Sajad Khodadadian

Optimistic Reinforcement Learning with Quantile Objectives

Mohammad Alipour-Vaezi, Huaiyang Zhong, Kwok-Leung Tsui, Sajad Khodadadian

TL;DR

This work tackles risk-sensitive reinforcement learning by targeting the τ-quantile of finite-horizon returns in MDPs. It introduces UCB--QRL, a model-based optimistic algorithm that builds an ℓ1 confidence set over the transition kernel and uses a quantile-aware planner, aided by a continuation–mixture representation and a uniform quantile margin κ. A key theoretical contribution is a high-probability regret bound that explicitly depends on κ, the horizon, and problem size, demonstrating sublinear performance under the margin condition. The paper also discusses computational aspects, situates the approach among risk-sensitive and UCB-based RL literature, and outlines open directions such as function approximation and extensions to infinite-horizon settings. Overall, the results provide principled tail-risk control for sequential decision-making with finite-horizon dynamics and pave the way for robust, data-efficient quantile optimization in safety-critical domains.

Abstract

Reinforcement Learning (RL) has achieved tremendous success in recent years. However, the classical foundations of RL do not account for the risk sensitivity of the objective function, which is critical in various fields, including healthcare and finance. A popular approach to incorporate risk sensitivity is to optimize a specific quantile of the cumulative reward distribution. In this paper, we develop UCB-QRL, an optimistic learning algorithm for the $τ$-quantile objective in finite-horizon Markov decision processes (MDPs). UCB-QRL is an iterative algorithm in which, at each iteration, we first estimate the underlying transition probability and then optimize the quantile value function over a confidence ball around this estimate. We show that UCB-QRL yields a high-probability regret bound $\mathcal O\left((2/κ)^{H+1}H\sqrt{SATH\log(2SATH/δ)}\right)$ in the episodic setting with $S$ states, $A$ actions, $T$ episodes, and $H$ horizons. Here, $κ>0$ is a problem-dependent constant that captures the sensitivity of the underlying MDP's quantile value.

Optimistic Reinforcement Learning with Quantile Objectives

TL;DR

Abstract

Optimistic Reinforcement Learning with Quantile Objectives

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (23)