Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning
Mehrdad Moghimi, Hyejin Ku
TL;DR
This work targets risk-sensitive sequential decision-making by replacing fixed stepwise risk in DRL with static Spectral Risk Measures (SRMs). It introduces QR-SRM, a convergent algorithm that optimizes SRMs within a distributional RL framework by leveraging an augmented MDP and a closed-form outer update for the risk function $h$, yielding interpretable intermediate risk preferences via the SRM decomposition. Empirically, QR-SRM learns policies aligned with the SRM objective and often outperforms risk-neutral DRL and fixed-CVaR baselines across trading and control domains, while also addressing the common ‘blindness to success’ issue of CVaR-based methods. The approach offers a principled, interpretable and flexible risk-sensitive framework with potential extensions to continuous action spaces and alternative distribution representations, enhancing safety and reliability in high-stakes applications.
Abstract
In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.
