Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

Mehrdad Moghimi; Hyejin Ku

Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

Mehrdad Moghimi, Hyejin Ku

TL;DR

This work targets risk-sensitive sequential decision-making by replacing fixed stepwise risk in DRL with static Spectral Risk Measures (SRMs). It introduces QR-SRM, a convergent algorithm that optimizes SRMs within a distributional RL framework by leveraging an augmented MDP and a closed-form outer update for the risk function $h$, yielding interpretable intermediate risk preferences via the SRM decomposition. Empirically, QR-SRM learns policies aligned with the SRM objective and often outperforms risk-neutral DRL and fixed-CVaR baselines across trading and control domains, while also addressing the common ‘blindness to success’ issue of CVaR-based methods. The approach offers a principled, interpretable and flexible risk-sensitive framework with potential extensions to continuous action spaces and alternative distribution representations, enhancing safety and reliability in high-stakes applications.

Abstract

In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

TL;DR

, yielding interpretable intermediate risk preferences via the SRM decomposition. Empirically, QR-SRM learns policies aligned with the SRM objective and often outperforms risk-neutral DRL and fixed-CVaR baselines across trading and control domains, while also addressing the common ‘blindness to success’ issue of CVaR-based methods. The approach offers a principled, interpretable and flexible risk-sensitive framework with potential extensions to continuous action spaces and alternative distribution representations, enhancing safety and reliability in high-stakes applications.

Abstract

Paper Structure (28 sections, 6 theorems, 68 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 28 sections, 6 theorems, 68 equations, 8 figures, 6 tables, 2 algorithms.

Introduction
Related Works
Preliminary Studies
Spectral Risk Measures
Markov Decision Process
Distributional RL
Decomposition of Coherent Risk Measures
The Model
Intermediate Risk Preferences
Experimental Results
American Put Option Trading
Mean-reversion Trading Strategy
Windy Lunar Lander
Conclusion
Property of the Closed-form solution
...and 13 more sections

Key Result

Theorem 4.1

If $\pi_{k,l}$ denotes the greedy policy extracted from $G_{k,l}$ and $h_l$, then for all $x \in \mathcal{X}, s \in \mathcal{S}, c \in \mathcal{C}$, and $a \in \mathcal{A}$, Additionally, $J(\pi^*_l, h_l)$ is bounded and monotonically increases as $l$ increases and provides a lower bound for our objective.

Figures (8)

Figure 1: A Markov process with the transition probabilities and rewards denoted on the edges and nodes. This process can also be considered as an MDP with a deterministic policy $\pi$. In this way, the number in each node denotes the $r(x, \pi(x))$.
Figure 2: The Quantile function and the CDF of the return-distributions in states $x_0$ (black), $x_1^1$ (green), and $x_1^2$ (blue) in Example \ref{['ex:ex1']}.
Figure 3: Figure \ref{['fig:sub11']} illustrates the distribution of discounted returns for different policies. Figure \ref{['fig:sub12']} demonstrates the exercise boundary of each policy.
Figure 4: Figure \ref{['fig:sub21']} illustrates the distribution of discounted returns for different policies. Figure \ref{['fig:sub22']} displays the risk spectrums used to derive these policies.
Figure 5: The Quantile function and the CDF of $G$ (black) and $G_t$ (blue) in Example \ref{['ex:ex2']}
...and 3 more figures

Theorems & Definitions (17)

Theorem 4.1
Theorem 5.1
Example 1
Lemma 2.1
proof
Lemma 2.2
proof
Lemma 2.3
proof
Theorem 2.4
...and 7 more

Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

TL;DR

Abstract

Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (17)