Table of Contents
Fetching ...

Measures of Variability for Risk-averse Policy Gradient

Yudong Luo, Yangchen Pan, Jiaqi Tan, Pascal Poupart

TL;DR

This work systematically studies nine variability measures for risk-averse policy optimization in reinforcement learning, deriving gradient formulas and evaluating their practical performance within REINFORCE and PPO. It shows that variance-based metrics often yield unstable updates, while CVaR Deviation and Gini Deviation provide robust, high-quality risk-averse policies across diverse domains; Mean Deviation and Semi_STD also offer strong, sample-efficient alternatives. The paper provides theoretical treatments, including gradient-biasedness analyses and proofs for several measures, and demonstrates how to implement mean-variability objectives with practical sampling and IS techniques. Together, these findings guide practitioners in selecting effective variability metrics for risk-aware decision-making and set a foundation for future theoretical and algorithmic advances in risk metrics for RL.

Abstract

Risk-averse reinforcement learning (RARL) is critical for decision-making under uncertainty, which is especially valuable in high-stake applications. However, most existing works focus on risk measures, e.g., conditional value-at-risk (CVaR), while measures of variability remain underexplored. In this paper, we comprehensively study nine common measures of variability, namely Variance, Gini Deviation, Mean Deviation, Mean-Median Deviation, Standard Deviation, Inter-Quantile Range, CVaR Deviation, Semi_Variance, and Semi_Standard Deviation. Among them, four metrics have not been previously studied in RARL. We derive policy gradient formulas for these unstudied metrics, improve gradient estimation for Gini Deviation, analyze their gradient properties, and incorporate them with the REINFORCE and PPO frameworks to penalize the dispersion of returns. Our empirical study reveals that variance-based metrics lead to unstable policy updates. In contrast, CVaR Deviation and Gini Deviation show consistent performance across different randomness and evaluation domains, achieving high returns while effectively learning risk-averse policies. Mean Deviation and Semi_Standard Deviation are also competitive across different scenarios. This work provides a comprehensive overview of variability measures in RARL, offering practical insights for risk-aware decision-making and guiding future research on risk metrics and RARL algorithms.

Measures of Variability for Risk-averse Policy Gradient

TL;DR

This work systematically studies nine variability measures for risk-averse policy optimization in reinforcement learning, deriving gradient formulas and evaluating their practical performance within REINFORCE and PPO. It shows that variance-based metrics often yield unstable updates, while CVaR Deviation and Gini Deviation provide robust, high-quality risk-averse policies across diverse domains; Mean Deviation and Semi_STD also offer strong, sample-efficient alternatives. The paper provides theoretical treatments, including gradient-biasedness analyses and proofs for several measures, and demonstrates how to implement mean-variability objectives with practical sampling and IS techniques. Together, these findings guide practitioners in selecting effective variability metrics for risk-aware decision-making and set a foundation for future theoretical and algorithmic advances in risk metrics for RL.

Abstract

Risk-averse reinforcement learning (RARL) is critical for decision-making under uncertainty, which is especially valuable in high-stake applications. However, most existing works focus on risk measures, e.g., conditional value-at-risk (CVaR), while measures of variability remain underexplored. In this paper, we comprehensively study nine common measures of variability, namely Variance, Gini Deviation, Mean Deviation, Mean-Median Deviation, Standard Deviation, Inter-Quantile Range, CVaR Deviation, Semi_Variance, and Semi_Standard Deviation. Among them, four metrics have not been previously studied in RARL. We derive policy gradient formulas for these unstudied metrics, improve gradient estimation for Gini Deviation, analyze their gradient properties, and incorporate them with the REINFORCE and PPO frameworks to penalize the dispersion of returns. Our empirical study reveals that variance-based metrics lead to unstable policy updates. In contrast, CVaR Deviation and Gini Deviation show consistent performance across different randomness and evaluation domains, achieving high returns while effectively learning risk-averse policies. Mean Deviation and Semi_Standard Deviation are also competitive across different scenarios. This work provides a comprehensive overview of variability measures in RARL, offering practical insights for risk-aware decision-making and guiding future research on risk metrics and RARL algorithms.

Paper Structure

This paper contains 65 sections, 21 theorems, 98 equations, 11 figures, 5 tables, 2 algorithms.

Key Result

Lemma 1

$\Phi_h(X)$ has a quantile representation. If $F^{-1}_X$ is continuous, then $\Phi_h(X)=\int^1_0 F^{-1}_X(1-\alpha)dh(\alpha)$, where $F^{-1}_X$ is the inverse cdf (quantile function) of X.

Figures (11)

  • Figure 1: A modified Maze. Red state returns an uncertain reward (details in text).
  • Figure 2: The reward distribution of the red state is Gaussian. (a) Return and (b) Risk-averse (long path) rate of each algorithm v.s. training episodes in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors.
  • Figure 3: The reward distribution of the red state is Pareto. (a) Return and (b) Risk-averse (long path) rate of each algorithm v.s. training episodes in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors.
  • Figure 4: The reward distribution of the red state is Uniform. (a) Return and (b) Risk-averse (long path) rate of each algorithm v.s. training episodes in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors.
  • Figure 5: The reward distribution of red state is a handcraft distribution. (a) Return and (b) Risk-averse (long path) rate of each algorithm v.s. training episodes in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors.
  • ...and 6 more figures

Theorems & Definitions (24)

  • Definition 1: artzner1999coherent
  • Definition 2: furman2017gini
  • Definition 3: wang2020characterization
  • Lemma 1: wang2020characterization, Lemma 3
  • Lemma 2: wang2020characterization
  • Proposition 1
  • Proposition 2
  • Lemma 3: wang2020distortion
  • Proposition 3
  • Theorem 1
  • ...and 14 more