A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

Kaiwen Wang; Dawen Liang; Nathan Kallus; Wen Sun

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun

TL;DR

This work develops a unified reductions framework for risk-sensitive reinforcement learning under optimized certainty equivalents (OCE). By augmenting the MDP with a budget state (AugMDP), it reduces OCE-RL to risk-neutral RL and enables two meta-algorithms: an optimism-based approach using various RL oracles and a policy-gradient-based method with natural policy gradient updates. The paper provides finite-sample, non-asymptotic guarantees for both approaches, including the first risk-sensitive bounds for exogenous block MDPs, and demonstrates that history-dependent policies are necessary for optimal OCE performance in a prove-of-concept MDP. Empirical results show that the proposed methods outperform the best Markovian policies across several OCE risk measures and policy classes, highlighting the practical impact for safety-critical and risk-aware applications.

Abstract

We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards. We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP). Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs. For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs. For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption. Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

TL;DR

Abstract

Paper Structure (27 sections, 19 theorems, 61 equations, 2 figures, 5 tables, 3 algorithms)

This paper contains 27 sections, 19 theorems, 61 equations, 2 figures, 5 tables, 3 algorithms.

Introduction
Related Works
Preliminaries
Augmented MDP for OCE
Meta-Algorithm with Optimism
Bounds for exogenous block MDPs
Proof for Main Reduction (\ref{['thm:optimism-regret']})
Meta-Algorithm with Policy Optimization
Case Study: Natural Policy Gradient
Simulation Experiments
Setting up synthetic MDP.
Experiment with tabular policies.
Experiment with neural network policies.
Conclusion
List of Notations
...and 12 more sections

Key Result

Theorem 2.1

There exists an initial budget $b^\star_1\in[0,1]$ s.t. the optimal risk-neutral $\pi_{\textnormal{aug}}^\star$ in the AugMDP with initial budget $b^\star_1$ achieves optimal OCE in the original MDP.

Figures (2)

Figure 1: A simple MDP where the optimal CVaR policy is history-dependent. Each policy's cumulative reward dist. is shown below.
Figure 2: Learning curves for \ref{['alg:policy-optimization']} with three oracles: REINFORCE and PPO with fwd & bwd KL. We repeat runs five times and report $95\%$ confidence intervals for the mean performance.

Theorems & Definitions (35)

Theorem 2.1
proof : Proof of \ref{['thm:informal-optimality']}
Definition 3.1: Optimistic oracle
Theorem 3.2
Definition 3.3
Theorem 3.6
proof : Proof of \ref{['thm:optimism-regret']}
Definition 4.1: PO Oracle
Theorem 4.2: Global Convergence
Lemma 4.2: RLB
...and 25 more

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

TL;DR

Abstract

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (35)