Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients
Parisa Davar, Frédéric Godin, Jose Garrido
TL;DR
This work tackles catastrophic tail risk in sequential decision making by optimizing CVaR$_{\alpha}$ with $\alpha$ near 1. It introduces POTPG, an EVT-based policy-gradient method that leverages peaks-over-threshold tail estimates to extrapolate the far tail and stabilize CVaR gradients, including automated threshold selection and a variance-reducing gradient procedure. The method is validated in controlled simulations and applied to dynamic option hedging under fat-tailed dynamics, where POTPG outperforms standard sample-averaging baselines, particularly when tail data are scarce. The results demonstrate that EVT-informed risk gradients can significantly improve tail-risk performance in reinforcement learning, with potential for extension to high-dimensional policies and deep RL.
Abstract
This paper tackles the problem of mitigating catastrophic risk (which is risk with very low frequency but very high severity) in the context of a sequential decision making process. This problem is particularly challenging due to the scarcity of observations in the far tail of the distribution of cumulative costs (negative rewards). A policy gradient algorithm is developed, that we call POTPG. It is based on approximations of the tail risk derived from extreme value theory. Numerical experiments highlight the out-performance of our method over common benchmarks, relying on the empirical distribution. An application to financial risk management, more precisely to the dynamic hedging of a financial option, is presented.
