Table of Contents
Fetching ...

Policy Gradient Methods for Distortion Risk Measures

Nithia Vijayan, Prashanth L. A

TL;DR

The paper develops policy gradient algorithms for risk-sensitive reinforcement learning by maximizing a distortion risk measure $\rho_g(\theta)$ of the cumulative reward in episodic MDPs. It derives a DRM-specific policy gradient theorem and pairs it with likelihood-ratio gradient estimators for both on-policy and off-policy settings, using order statistics and importance sampling to construct tractable gradient estimates. The authors prove non-asymptotic convergence guarantees to an $\epsilon$-stationary point and validate the approach with simulations in a grid-world, showing that certain distortion functions (e.g., logarithmic) can yield safer and higher-variance-aware policies. This work provides finite-sample guarantees for DRM-based RL and enables explicit risk-aware decision-making in practice.

Abstract

We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy and off-policy RL settings, respectively. We derive a variant of the policy gradient theorem that caters to the DRM objective, and integrate it with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.

Policy Gradient Methods for Distortion Risk Measures

TL;DR

The paper develops policy gradient algorithms for risk-sensitive reinforcement learning by maximizing a distortion risk measure of the cumulative reward in episodic MDPs. It derives a DRM-specific policy gradient theorem and pairs it with likelihood-ratio gradient estimators for both on-policy and off-policy settings, using order statistics and importance sampling to construct tractable gradient estimates. The authors prove non-asymptotic convergence guarantees to an -stationary point and validate the approach with simulations in a grid-world, showing that certain distortion functions (e.g., logarithmic) can yield safer and higher-variance-aware policies. This work provides finite-sample guarantees for DRM-based RL and enables explicit risk-aware decision-making in practice.

Abstract

We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy and off-policy RL settings, respectively. We derive a variant of the policy gradient theorem that caters to the DRM objective, and integrate it with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.

Paper Structure

This paper contains 14 sections, 81 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Examples of distortion functions
  • Figure 2: Modified Frozen Lake