On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

Xian Yu; Lei Ying

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

Xian Yu, Lei Ying

TL;DR

This work addresses the global convergence of risk-averse reinforcement learning by employing dynamic time-consistent Expected Conditional Risk Measures (ECRMs) to define a risk-sensitive objective. It develops entropy-regularized, softmax-parameterized Natural Policy Gradient (NPG) methods that separate first-step and subsequent-step updates, proving global optimality and linear convergence under exact policy evaluation and extending guarantees to inexact policy evaluation with a controlled error budget. The analysis yields dimension-free iteration complexity and provides explicit rates based on problem constants and the learning rate $\beta$, highlighting the role of $\tau$ in balancing exploration and convergence. Empirical results on a stochastic Cliffwalk corroborate the theoretical findings, showing improved stability and faster attainment of low-cost policies compared to risk-neutral baselines, validating the practical relevance of ECRM-based NPG for risk-sensitive decision-making.

Abstract

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

TL;DR

, highlighting the role of

in balancing exploration and convergence. Empirical results on a stochastic Cliffwalk corroborate the theoretical findings, showing improved stability and faster attainment of low-cost policies compared to risk-neutral baselines, validating the practical relevance of ECRM-based NPG for risk-sensitive decision-making.

Abstract

Paper Structure (14 sections, 12 theorems, 91 equations, 2 figures, 1 algorithm)

This paper contains 14 sections, 12 theorems, 91 equations, 2 figures, 1 algorithm.

Introduction
Preliminaries
Notation.
Policy Gradient Methods
Coherent One-Step Conditional Risk Measures
Expected Conditional Risk Measures
Global Convergence of Risk-Averse Natural Policy Gradient Algorithms
Risk-Averse NPG Algorithms with Exact Policy Evaluation
Approximate Risk-Averse NPG Algorithms with Inexact Policy Evaluation
Numerical Results
Conclusions
Omitted Proofs in Section \ref{['sec:problem']}
Omitted Proofs in Section \ref{['sec:natural']}
Detailed Algorithm

Key Result

Proposition 1

Denote the ECRM objective function under the original $\eta$-space and the discretized $\mathcal{H}$ space as $\mathbb{F}(c_{[1,\infty]}|s_1)$ and $\mathbb{F}^I(c_{[1,\infty]}|s_1)$, respectively. Then for any given $\epsilon_{opt}> 0$, we have whenever $I\ge(1+\frac{1}{\alpha})\frac{ \lambda\gamma}{1-\gamma}\frac{1}{\epsilon_{opt}}$.

Figures (2)

Figure 1: Illustration of CVaR.
Figure 2: Risk-averse NPG v.s. PG algorithm with varying $\tau$.

Theorems & Definitions (15)

Proposition 1: $\epsilon_{opt}$-optimal Discretization
Theorem 1
Theorem 2: Risk-Averse Policy Gradients with Entropy Regularizer
Lemma 1
Lemma 2
Theorem 3: Performance Improvement
Theorem 4: Linear Convergence of Exact Risk-Averse NPG
Remark 1: Linear convergence of soft value functions
Remark 2: Iteration complexity for achieving an $\epsilon$-optimal policy of the original MDP
Lemma 3
...and 5 more

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

TL;DR

Abstract

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)