A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

Yudong Luo; Yangchen Pan; Han Wang; Philip Torr; Pascal Poupart

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

Yudong Luo, Yangchen Pan, Han Wang, Philip Torr, Pascal Poupart

TL;DR

The paper tackles sample inefficiency in CVaR-based RL by introducing a simple mixture policy parameterization that blends a risk-neutral component with an adjustable, risk-averse module: $\pi(a|s) = w(s)\pi'(a|s) + (1-w(s))\pi^n(a|s)$. This design allows using all collected trajectories for updates and mitigates gradient vanishing by driving higher returns through the risk-neutral part, effectively lifting the left tail of the return distribution. Empirical results across Maze, LunarLander, and Mujoco show that MIX can learn risk-averse policies in scenarios where CVaR-PG struggles, and can outperform off-policy baselines that rely on environment dynamics control. The approach offers a broadly applicable, simple method to enhance CVaR optimization in RL with potential for integration with other sample-efficiency techniques.

Abstract

Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

TL;DR

The paper tackles sample inefficiency in CVaR-based RL by introducing a simple mixture policy parameterization that blends a risk-neutral component with an adjustable, risk-averse module:

. This design allows using all collected trajectories for updates and mitigates gradient vanishing by driving higher returns through the risk-neutral part, effectively lifting the left tail of the return distribution. Empirical results across Maze, LunarLander, and Mujoco show that MIX can learn risk-averse policies in scenarios where CVaR-PG struggles, and can outperform off-policy baselines that rely on environment dynamics control. The approach offers a broadly applicable, simple method to enhance CVaR optimization in RL with potential for integration with other sample-efficiency techniques.

Abstract

Paper Structure (36 sections, 16 equations, 12 figures, 1 algorithm)

This paper contains 36 sections, 16 equations, 12 figures, 1 algorithm.

Introduction
Background: CVaR Optimization in RL
Problem Formulation
CVaR Policy Gradient (CVaR-PG)
Distributional RL with CVaR
Other CVaR RL Algorithms
Mixture Parameterization Policies
Challenges of CVar-PG: low-efficiency gradient estimation
Mixture with Risk-neutral Policy
A Motivating Maze Example
Offline RL Risk Neutral Learning
Experiments
Tabular case: Maze Problem
Discrete control: LunarLander
Continuous control: Mujoco
...and 21 more sections

Figures (12)

Figure 1: (a) A maze domain with green goal state. The red state returns an uncertain reward (details in Sec. \ref{['sec:maze']}). Triangle pointers indicate the risk-neutral actions (not unique for the second state). (b) Value of $w$ of Eq. \ref{['eq:mix-pi']} for each state after the mixture policy is updated by CVaR-PG. (c) The empirical quantile function of the total return in maze at an early training stage, if the initial policy is a random and mixture policy.
Figure 2: (a) Policy return (y-axsis) and (b) Risk-aversion (long path) rate (y-axsis) v.s. training episodes in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors.
Figure 3: (a,c) Policy return (y-axis), and (b,d) Left-landing rate (i.e., risk-averse landing rate) (y-axis) v.s. training episodes or steps in LunarLander. Curves are averaged over 10 seeds with shaded regions indicating standard errors. For the landing left rate, higher is better.
Figure 4: (a, c) Policy return (y-axsis) in InvertedPendulum, (b, d) visiting non-noisy region rate (y-axis) in InvertedPendulum, (e, g) Final X-position (y-axsis) in HalfCheetah, (f, h) Final X-position in Ant (y-axsis) v.s. training episodes or steps in Mujoco. Curves are averaged over 10 seeds with shaded regions indicating standard errors. For the location visiting rate, higher is better.
Figure 5: Policy gradient norm (y-axsis) of CVaR-PG in Maze. Curves are averaged over 10 seeds with shaded regions indicating standard errors
...and 7 more figures

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

TL;DR

Abstract

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (12)