Table of Contents
Fetching ...

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Minheng Xiao, Xian Yu, Lei Ying

TL;DR

This paper addresses risk-sensitive reinforcement learning within a distributional RL framework by seeking gradients of coherent risk measures over the full distribution of discounted costs. It develops a distributional policy gradient theory that yields an explicit gradient of the probability measure, and introduces CDPG, a finite-support, categorical approximation with provable convergence guarantees under inexact policy evaluation. The approach combines distributional policy evaluation with a categorical gradient framework to provide finite-time convergence of the policy updates and demonstrates improved sample efficiency in risk-sensitive settings on Cliffwalk and CartPole compared to non-distributional baselines. Overall, the work advances practical risk-aware DRL with rigorous gradient derivations and convergence guarantees, enabling safer and more reliable policies for high-stakes applications.

Abstract

Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

TL;DR

This paper addresses risk-sensitive reinforcement learning within a distributional RL framework by seeking gradients of coherent risk measures over the full distribution of discounted costs. It develops a distributional policy gradient theory that yields an explicit gradient of the probability measure, and introduces CDPG, a finite-support, categorical approximation with provable convergence guarantees under inexact policy evaluation. The approach combines distributional policy evaluation with a categorical gradient framework to provide finite-time convergence of the policy updates and demonstrates improved sample efficiency in risk-sensitive settings on Cliffwalk and CartPole compared to non-distributional baselines. Overall, the work advances practical risk-aware DRL with rigorous gradient derivations and convergence guarantees, enabling safer and more reliable policies for high-stakes applications.

Abstract

Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.
Paper Structure (34 sections, 25 theorems, 118 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 34 sections, 25 theorems, 118 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.1

A risk measure is coherent iff there exists a convex bounded and closed set ${\mathcal{U}} \subset {\mathcal{B}}$, called risk envelope, such that for any random variable $Z \in {\mathcal{Z}}$, where ${\mathcal{B}}=\{\xi: \int_{\Omega}\xi(\omega)f_{Z}(\omega)d\omega=1,\ \xi\succeq 0\}$ and $\mathbb{E}_{\xi}[Z] = \int_{\Omega} \xi(\omega) f_{Z}(\omega)Z(\omega)d\omega$ is the $\xi$-weighted expect

Figures (4)

  • Figure 1: Comparison between risk-averse and risk-neutral policies. Figure (a) illustrates the environment settings. Figure (b) displays the cost distribution. Figure (c) shows the average test cost and Figure (d) shows the average test cost under a warm-start and early-stopping regime, which speeds up training.
  • Figure 2: Comparison between CDPG and SPG tamar2015policy algorithm under Cliffwalking settings. Figure (a) shows the divergence from the safe path using different fixed sample sizes after 100 iterations. Figures (b), (c), and (d) depict the average test cost with respect to the iteration count, the number of trajectories sampled, and the computational time, respectively, where CDPG is accelerated using a warm-start and early-stopping regime.
  • Figure 3: Comparison between the CDPG and SPG tamar2015policy algorithms in the CartPole environment with a continuous state space. Figure (a) shows an example CartPole state where the best action is to move to the right. Figure (b) presents the cost estimates for the two possible actions. Figures (c) and (d) illustrate the cumulative score with respect to the iteration count and the number of sampled trajectories, respectively.
  • Figure :

Theorems & Definitions (56)

  • Theorem 2.1: artzner1999coherentshapiro2009lectures
  • Theorem 2.2: tamar2015policy
  • Definition 2.5: Pushforward Measure
  • Definition 2.6: Distributional Bellman Operator rowland2018analysis
  • Proposition 2.7: bellemare2017distributional
  • Lemma 2.8: Distributional Bellman Equation rowland2018analysis
  • Theorem 3.1: Distributional Policy Gradient Theorem
  • Remark 3.2
  • Corollary 3.3
  • Definition 4.1
  • ...and 46 more