Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

Sumedh Gupte; Shrey Rakeshkumar Patel; Soumen Pachal; Prashanth L. A.; Sanjay P. Bhat

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

Sumedh Gupte, Shrey Rakeshkumar Patel, Soumen Pachal, Prashanth L. A., Sanjay P. Bhat

TL;DR

This work addresses risk-sensitive decision making in reinforcement learning by optimizing three convex risk measures—expectiles, utility-based shortfall risk (UBSR), and optimized certainty equivalents (OCE)—within finite-horizon MDPs. It develops policy-gradient theorems for each risk, constructs trajectory-based gradient estimators with non-asymptotic error bounds, and proves smoothness and convergence properties of a general risk-aware policy-gradient framework. The paper also introduces a practical RAPG algorithm and provides non-asymptotic convergence guarantees, then validates the theory with MuJoCo Reacher experiments showing improved performance and reduced variance compared to standard REINFORCE. Overall, it offers a unified, theoretically grounded methodology for risk-aware RL that covers multiple risk measures and demonstrates tangible gains on benchmark tasks.

Abstract

We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish $\mathcal{O}\left(1/m\right)$ mean-squared error bounds for our estimators, where $m$ is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

TL;DR

Abstract

mean-squared error bounds for our estimators, where

is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.

Paper Structure (42 sections, 31 theorems, 160 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 42 sections, 31 theorems, 160 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Preliminaries
Risk measures
Risk-sensitive RL with expectiles
Estimation of expectiles
Policy gradient for expectiles
Expectile gradient estimation.
Risk-sensitive RL with UBSR
Policy gradient theorem for UBSR.
Sample-based estimators of the UBSR gradient.
Risk-sensitive RL with OCE
Risk-sensitive policy gradient algorithm
Experiments
Conclusions
Table of risk measures
...and 27 more sections

Key Result

Theorem 1

Suppose $\mathbb{P}\left( X = \xi_\nu \right)=0$ and $X$ has a finite second moment. Then, we have the following bound for $\hat{\xi}_{\nu}^m$ formed using eq:empirical_expectile_identification_equation: In addition, if $X$ is sub-GaussianA random variable $X$ is $\sigma$-sub-Gaussian if $\mathbb{E}[\exp(\lambda(X - \mathbb{E}[X]))] \leq \exp(\frac{\lambda^2\sigma^2}{2})$ for all $\lambda \in \ma

Figures (1)

Figure 1: Performance of REINFORCE and four variants of RAPG with entropic risk, expectile, quadratic risk and mean-variance risk, respectively. The first subplot shows the average trajectory rewards, while the second subplot presents the trajectory reward distribution of the converged policies using $250$ independent episodes.

Theorems & Definitions (66)

Definition 1
Definition 2
Theorem 1
Theorem 2
Theorem 3
Remark 1
Remark 2
Remark 3
Theorem 4
Lemma 1
...and 56 more

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

TL;DR

Abstract

Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (66)