Table of Contents
Fetching ...

Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

Xinyi Ni, Lifeng Lai

TL;DR

The paper addresses robustness in CVaR-based risk-sensitive reinforcement learning within Robust MDPs (RMDPs) by leveraging the dual representation of CVaR and exploring both fixed-budget (Radon-Nikodym and KL) ambiguity sets. It shows that robust CVaR optimization can be recast as risk-sensitive RL with adjusted confidence levels, e.g., $\alpha' = \alpha/K$ for Radon-Nikodym and $\alpha' = \alpha/\kappa^{1/\alpha}$ for KL, and introduces EVaR as an alternative under KL divergence. To handle decision-dependent uncertainty, it introduces NCVaR, proves its coherence, provides a NCVaR decomposition, and develops a convergent NCVaR value-iteration approach (with interpolation) to compute optimal policies. Experiments on a grid world demonstrate that the proposed methods yield more risk-averse, robust policies and highlight avenues for extending robustness to broader uncertainty sets in risk-sensitive RL.

Abstract

Robust Markov Decision Processes (RMDPs) have received significant research interest, offering an alternative to standard Markov Decision Processes (MDPs) that often assume fixed transition probabilities. RMDPs address this by optimizing for the worst-case scenarios within ambiguity sets. While earlier studies on RMDPs have largely centered on risk-neutral reinforcement learning (RL), with the goal of minimizing expected total discounted costs, in this paper, we analyze the robustness of CVaR-based risk-sensitive RL under RMDP. Firstly, we consider predetermined ambiguity sets. Based on the coherency of CVaR, we establish a connection between robustness and risk sensitivity, thus, techniques in risk-sensitive RL can be adopted to solve the proposed problem. Furthermore, motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets. To solve this, we define a new risk measure named NCVaR and build the equivalence of NCVaR optimization and robust CVaR optimization. We further propose value iteration algorithms and validate our approach in simulation experiments.

Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

TL;DR

The paper addresses robustness in CVaR-based risk-sensitive reinforcement learning within Robust MDPs (RMDPs) by leveraging the dual representation of CVaR and exploring both fixed-budget (Radon-Nikodym and KL) ambiguity sets. It shows that robust CVaR optimization can be recast as risk-sensitive RL with adjusted confidence levels, e.g., for Radon-Nikodym and for KL, and introduces EVaR as an alternative under KL divergence. To handle decision-dependent uncertainty, it introduces NCVaR, proves its coherence, provides a NCVaR decomposition, and develops a convergent NCVaR value-iteration approach (with interpolation) to compute optimal policies. Experiments on a grid world demonstrate that the proposed methods yield more risk-averse, robust policies and highlight avenues for extending robustness to broader uncertainty sets in risk-sensitive RL.

Abstract

Robust Markov Decision Processes (RMDPs) have received significant research interest, offering an alternative to standard Markov Decision Processes (MDPs) that often assume fixed transition probabilities. RMDPs address this by optimizing for the worst-case scenarios within ambiguity sets. While earlier studies on RMDPs have largely centered on risk-neutral reinforcement learning (RL), with the goal of minimizing expected total discounted costs, in this paper, we analyze the robustness of CVaR-based risk-sensitive RL under RMDP. Firstly, we consider predetermined ambiguity sets. Based on the coherency of CVaR, we establish a connection between robustness and risk sensitivity, thus, techniques in risk-sensitive RL can be adopted to solve the proposed problem. Furthermore, motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets. To solve this, we define a new risk measure named NCVaR and build the equivalence of NCVaR optimization and robust CVaR optimization. We further propose value iteration algorithms and validate our approach in simulation experiments.
Paper Structure (10 sections, 3 theorems, 24 equations, 1 figure, 2 algorithms)

This paper contains 10 sections, 3 theorems, 24 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

(NCVaR Decomposition) For any $\alpha\in(0,1]$ and $\vec{\kappa}$ satisfying Assumption asu:budget, the $\text{NCVaR}_{\alpha,\vec{\kappa}}$ has the following decomposition where $\xi(x_{t+1})=\frac{Q(x'|x,a)}{P(x'|x,a)}\geq 0$ is in the set

Figures (1)

  • Figure 1: Optimal value function and path in robust CVaR optimization across various uncertainty sets.

Theorems & Definitions (4)

  • Definition 1
  • Theorem 1
  • Lemma 1
  • Theorem 2