Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients

Parisa Davar; Frédéric Godin; Jose Garrido

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients

Parisa Davar, Frédéric Godin, Jose Garrido

TL;DR

This work tackles catastrophic tail risk in sequential decision making by optimizing CVaR$_{\alpha}$ with $\alpha$ near 1. It introduces POTPG, an EVT-based policy-gradient method that leverages peaks-over-threshold tail estimates to extrapolate the far tail and stabilize CVaR gradients, including automated threshold selection and a variance-reducing gradient procedure. The method is validated in controlled simulations and applied to dynamic option hedging under fat-tailed dynamics, where POTPG outperforms standard sample-averaging baselines, particularly when tail data are scarce. The results demonstrate that EVT-informed risk gradients can significantly improve tail-risk performance in reinforcement learning, with potential for extension to high-dimensional policies and deep RL.

Abstract

This paper tackles the problem of mitigating catastrophic risk (which is risk with very low frequency but very high severity) in the context of a sequential decision making process. This problem is particularly challenging due to the scarcity of observations in the far tail of the distribution of cumulative costs (negative rewards). A policy gradient algorithm is developed, that we call POTPG. It is based on approximations of the tail risk derived from extreme value theory. Numerical experiments highlight the out-performance of our method over common benchmarks, relying on the empirical distribution. An application to financial risk management, more precisely to the dynamic hedging of a financial option, is presented.

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients

TL;DR

This work tackles catastrophic tail risk in sequential decision making by optimizing CVaR

with

near 1. It introduces POTPG, an EVT-based policy-gradient method that leverages peaks-over-threshold tail estimates to extrapolate the far tail and stabilize CVaR gradients, including automated threshold selection and a variance-reducing gradient procedure. The method is validated in controlled simulations and applied to dynamic option hedging under fat-tailed dynamics, where POTPG outperforms standard sample-averaging baselines, particularly when tail data are scarce. The results demonstrate that EVT-informed risk gradients can significantly improve tail-risk performance in reinforcement learning, with potential for extension to high-dimensional policies and deep RL.

Abstract

Paper Structure (11 sections, 2 theorems, 25 equations, 3 figures, 2 algorithms)

This paper contains 11 sections, 2 theorems, 25 equations, 3 figures, 2 algorithms.

Introduction
A risk-aware reinforcement learning problem and policy gradients
A risk-aware reinforcement learning problem
A policy gradient solution approach
Integrating extreme value theory estimates into policy gradients
Estimation of CVaR with the peaks-over-threshold approach
Our proposed EVT policy gradient algorithm
Simulation experiments in a controlled environment
Application to financial hedging
The hedging framework
Conclusion

Key Result

Theorem 3.1

If $F \in MDA(H_{\xi})$, there exists a positive measurable function $\sigma (u)$ such that where $y_0 = sup\{ y \in \mathbb R; F(y)<1\} \leq \infty$ and $G_{\xi,\sigma (u)}(y)$.

Figures (3)

Figure 1: Training performance for the POTPG algorithm and the sample averaging (SA) benchmark. Left column: RMSE of policy parameter estimate $\text{RMSE}_\theta$. Right column: RMSE of the objective function (the CVaR) $\text{RMSE}_{\widehat{J}}$. RMSE metrics are computed over $R=50$ runs.
Figure 2: Objective function (CVaR$_{0.999}$ of the hedging shortfall) versus the hedge ratio $\theta$, representing the percentage of the target option Gamma being neutralized. Estimates are obtained by brute force calculations, i.e. through sample averaing over $1,\!000,\!000$ simulated paths. Red point: optimal value.
Figure 3: Evolution of the RMSE of the estimate of the optimal policy parameter (RMSE$_\theta$) and the corresponding objective function (RMSE$_{\hat{J}}$) over iterations of the POTPG algorithm and the sample averaging (SA) benchmark. Top row: sample size $n=1,\!000$. Bottow row: $n=10,\!000$. Left panels: RMSE$_\theta$. Right panels: RMSE$_{\hat{J}}$.

Theorems & Definitions (5)

Definition 3.1
Definition 3.2
Theorem 3.1: Pickands–Balkema–de Haan
Corollary 3.1
Remark 6.1

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients

TL;DR

Abstract

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)