Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Minheng Xiao; Xian Yu; Lei Ying

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Minheng Xiao, Xian Yu, Lei Ying

TL;DR

This paper addresses risk-sensitive reinforcement learning within a distributional RL framework by seeking gradients of coherent risk measures over the full distribution of discounted costs. It develops a distributional policy gradient theory that yields an explicit gradient of the probability measure, and introduces CDPG, a finite-support, categorical approximation with provable convergence guarantees under inexact policy evaluation. The approach combines distributional policy evaluation with a categorical gradient framework to provide finite-time convergence of the policy updates and demonstrates improved sample efficiency in risk-sensitive settings on Cliffwalk and CartPole compared to non-distributional baselines. Overall, the work advances practical risk-aware DRL with rigorous gradient derivations and convergence guarantees, enabling safer and more reliable policies for high-stakes applications.

Abstract

Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

TL;DR

Abstract

Paper Structure (34 sections, 25 theorems, 118 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 34 sections, 25 theorems, 118 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
Prior Work.
Main Contributions of Our Paper and Comparisons with Prior Work.
Preliminaries
Markov Decision Process (MDP).
Policy Gradient Methods.
Coherent Risk Measures.
Distributional Reinforcement Learning (DRL).
Distributional Policy Gradient
Categorical Distributional Policy Gradient with Provable Convergence
Categorical Approximation
CDPG Algorithm
Finite-Time Convergence Analysis under Inexact Policy Evaluation
Numerical Experiments
Cliffwalk
...and 19 more sections

Key Result

Theorem 2.1

A risk measure is coherent iff there exists a convex bounded and closed set ${\mathcal{U}} \subset {\mathcal{B}}$, called risk envelope, such that for any random variable $Z \in {\mathcal{Z}}$, where ${\mathcal{B}}=\{\xi: \int_{\Omega}\xi(\omega)f_{Z}(\omega)d\omega=1,\ \xi\succeq 0\}$ and $\mathbb{E}_{\xi}[Z] = \int_{\Omega} \xi(\omega) f_{Z}(\omega)Z(\omega)d\omega$ is the $\xi$-weighted expect

Figures (4)

Figure 1: Comparison between risk-averse and risk-neutral policies. Figure (a) illustrates the environment settings. Figure (b) displays the cost distribution. Figure (c) shows the average test cost and Figure (d) shows the average test cost under a warm-start and early-stopping regime, which speeds up training.
Figure 2: Comparison between CDPG and SPG tamar2015policy algorithm under Cliffwalking settings. Figure (a) shows the divergence from the safe path using different fixed sample sizes after 100 iterations. Figures (b), (c), and (d) depict the average test cost with respect to the iteration count, the number of trajectories sampled, and the computational time, respectively, where CDPG is accelerated using a warm-start and early-stopping regime.
Figure 3: Comparison between the CDPG and SPG tamar2015policy algorithms in the CartPole environment with a continuous state space. Figure (a) shows an example CartPole state where the best action is to move to the right. Figure (b) presents the cost estimates for the two possible actions. Figures (c) and (d) illustrate the cumulative score with respect to the iteration count and the number of sampled trajectories, respectively.
Figure :

Theorems & Definitions (56)

Theorem 2.1: artzner1999coherentshapiro2009lectures
Theorem 2.2: tamar2015policy
Definition 2.5: Pushforward Measure
Definition 2.6: Distributional Bellman Operator rowland2018analysis
Proposition 2.7: bellemare2017distributional
Lemma 2.8: Distributional Bellman Equation rowland2018analysis
Theorem 3.1: Distributional Policy Gradient Theorem
Remark 3.2
Corollary 3.3
Definition 4.1
...and 46 more

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

TL;DR

Abstract

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (56)