Table of Contents
Fetching ...

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

Xian Yu, Lei Ying

TL;DR

This work addresses the global convergence of risk-averse reinforcement learning by employing dynamic time-consistent Expected Conditional Risk Measures (ECRMs) to define a risk-sensitive objective. It develops entropy-regularized, softmax-parameterized Natural Policy Gradient (NPG) methods that separate first-step and subsequent-step updates, proving global optimality and linear convergence under exact policy evaluation and extending guarantees to inexact policy evaluation with a controlled error budget. The analysis yields dimension-free iteration complexity and provides explicit rates based on problem constants and the learning rate $\beta$, highlighting the role of $\tau$ in balancing exploration and convergence. Empirical results on a stochastic Cliffwalk corroborate the theoretical findings, showing improved stability and faster attainment of low-cost policies compared to risk-neutral baselines, validating the practical relevance of ECRM-based NPG for risk-sensitive decision-making.

Abstract

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

TL;DR

This work addresses the global convergence of risk-averse reinforcement learning by employing dynamic time-consistent Expected Conditional Risk Measures (ECRMs) to define a risk-sensitive objective. It develops entropy-regularized, softmax-parameterized Natural Policy Gradient (NPG) methods that separate first-step and subsequent-step updates, proving global optimality and linear convergence under exact policy evaluation and extending guarantees to inexact policy evaluation with a controlled error budget. The analysis yields dimension-free iteration complexity and provides explicit rates based on problem constants and the learning rate , highlighting the role of in balancing exploration and convergence. Empirical results on a stochastic Cliffwalk corroborate the theoretical findings, showing improved stability and faster attainment of low-cost policies compared to risk-neutral baselines, validating the practical relevance of ECRM-based NPG for risk-sensitive decision-making.

Abstract

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.
Paper Structure (14 sections, 12 theorems, 91 equations, 2 figures, 1 algorithm)

This paper contains 14 sections, 12 theorems, 91 equations, 2 figures, 1 algorithm.

Key Result

Proposition 1

Denote the ECRM objective function under the original $\eta$-space and the discretized $\mathcal{H}$ space as $\mathbb{F}(c_{[1,\infty]}|s_1)$ and $\mathbb{F}^I(c_{[1,\infty]}|s_1)$, respectively. Then for any given $\epsilon_{opt}> 0$, we have whenever $I\ge(1+\frac{1}{\alpha})\frac{ \lambda\gamma}{1-\gamma}\frac{1}{\epsilon_{opt}}$.

Figures (2)

  • Figure 1: Illustration of CVaR.
  • Figure 2: Risk-averse NPG v.s. PG algorithm with varying $\tau$.

Theorems & Definitions (15)

  • Proposition 1: $\epsilon_{opt}$-optimal Discretization
  • Theorem 1
  • Theorem 2: Risk-Averse Policy Gradients with Entropy Regularizer
  • Lemma 1
  • Lemma 2
  • Theorem 3: Performance Improvement
  • Theorem 4: Linear Convergence of Exact Risk-Averse NPG
  • Remark 1: Linear convergence of soft value functions
  • Remark 2: Iteration complexity for achieving an $\epsilon$-optimal policy of the original MDP
  • Lemma 3
  • ...and 5 more