Table of Contents
Fetching ...

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

Sumedh Gupte, Shrey Rakeshkumar Patel, Soumen Pachal, Prashanth L. A., Sanjay P. Bhat

TL;DR

This work addresses risk-sensitive decision making in reinforcement learning by optimizing three convex risk measures—expectiles, utility-based shortfall risk (UBSR), and optimized certainty equivalents (OCE)—within finite-horizon MDPs. It develops policy-gradient theorems for each risk, constructs trajectory-based gradient estimators with non-asymptotic error bounds, and proves smoothness and convergence properties of a general risk-aware policy-gradient framework. The paper also introduces a practical RAPG algorithm and provides non-asymptotic convergence guarantees, then validates the theory with MuJoCo Reacher experiments showing improved performance and reduced variance compared to standard REINFORCE. Overall, it offers a unified, theoretically grounded methodology for risk-aware RL that covers multiple risk measures and demonstrates tangible gains on benchmark tasks.

Abstract

We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish $\mathcal{O}\left(1/m\right)$ mean-squared error bounds for our estimators, where $m$ is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

TL;DR

This work addresses risk-sensitive decision making in reinforcement learning by optimizing three convex risk measures—expectiles, utility-based shortfall risk (UBSR), and optimized certainty equivalents (OCE)—within finite-horizon MDPs. It develops policy-gradient theorems for each risk, constructs trajectory-based gradient estimators with non-asymptotic error bounds, and proves smoothness and convergence properties of a general risk-aware policy-gradient framework. The paper also introduces a practical RAPG algorithm and provides non-asymptotic convergence guarantees, then validates the theory with MuJoCo Reacher experiments showing improved performance and reduced variance compared to standard REINFORCE. Overall, it offers a unified, theoretically grounded methodology for risk-aware RL that covers multiple risk measures and demonstrates tangible gains on benchmark tasks.

Abstract

We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish mean-squared error bounds for our estimators, where is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.
Paper Structure (42 sections, 31 theorems, 160 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 42 sections, 31 theorems, 160 equations, 1 figure, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose $\mathbb{P}\left( X = \xi_\nu \right)=0$ and $X$ has a finite second moment. Then, we have the following bound for $\hat{\xi}_{\nu}^m$ formed using eq:empirical_expectile_identification_equation: In addition, if $X$ is sub-GaussianA random variable $X$ is $\sigma$-sub-Gaussian if $\mathbb{E}[\exp(\lambda(X - \mathbb{E}[X]))] \leq \exp(\frac{\lambda^2\sigma^2}{2})$ for all $\lambda \in \ma

Figures (1)

  • Figure 1: Performance of REINFORCE and four variants of RAPG with entropic risk, expectile, quadratic risk and mean-variance risk, respectively. The first subplot shows the average trajectory rewards, while the second subplot presents the trajectory reward distribution of the converged policies using $250$ independent episodes.

Theorems & Definitions (66)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 4
  • Lemma 1
  • ...and 56 more