On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures
Xian Yu, Lei Ying
TL;DR
This work addresses the global convergence of risk-averse reinforcement learning by employing dynamic time-consistent Expected Conditional Risk Measures (ECRMs) to define a risk-sensitive objective. It develops entropy-regularized, softmax-parameterized Natural Policy Gradient (NPG) methods that separate first-step and subsequent-step updates, proving global optimality and linear convergence under exact policy evaluation and extending guarantees to inexact policy evaluation with a controlled error budget. The analysis yields dimension-free iteration complexity and provides explicit rates based on problem constants and the learning rate $\beta$, highlighting the role of $\tau$ in balancing exploration and convergence. Empirical results on a stochastic Cliffwalk corroborate the theoretical findings, showing improved stability and faster attainment of low-cost policies compared to risk-neutral baselines, validating the practical relevance of ECRM-based NPG for risk-sensitive decision-making.
Abstract
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.
