Table of Contents
Fetching ...

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Zilong Deng, Simon Khan, Shaofeng Zou

TL;DR

The paper tackles the sample complexity of Iterated CVaR RL with a generative model, formulating the problem via CVaR's dual to connect with distributionally robust RL. It develops the ICVaR-VI algorithm and proves near-optimal upper and lower bounds, with a general bound of $\tilde{O}\left(\frac{SA}{\tau^2(1-\\gamma)^4\\epsilon^2}\right)$ and an improved bound when $\\tau \geq\\gamma$, complemented by a matching minimax lower bound $\tilde{O}\left(\frac{(1-\\gamma\\tau)SA}{(1-\\gamma)^4\\tau\\epsilon^2}\right)$; a Worst-Path RL regime yields a bound of $\tilde{O}\left(\frac{SA}{p_{\\min}}\right)$. These results establish minimax optimality for fixed risk levels and reveal distinct behavior in the small-$\\tau$ regime. The approach leverages a robust Bellman operator and per-$(s,a)$ uncertainty sets, enabling a model-based, value-iteration framework with tight dependence on the horizon, state/action counts, and risk tolerance. The findings have practical implications for risk-sensitive decision-making in settings where samples are costly and safety is critical, and they open avenues for extending the framework to other coherent risk measures.

Abstract

In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $τ$ at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and $(s, a)$-rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $ε$-optimal policy with at most $\tilde{O} \left(\frac{SA}{(1-γ)^4τ^2ε^2} \right)$ samples, where $γ$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, when $τ\geq γ$, the sample complexity improves to $\tilde{O} \left( \frac{SA}{(1-γ)^3ε^2} \right)$. We further show a minimax lower bound of $\tilde{O} \left(\frac{(1-γτ)SA}{(1-γ)^4τε^2} \right)$. For a fixed risk level $τ\in (0,1]$, our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level $τ$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{O} \left(\frac{SA}{p_{\min}} \right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

TL;DR

The paper tackles the sample complexity of Iterated CVaR RL with a generative model, formulating the problem via CVaR's dual to connect with distributionally robust RL. It develops the ICVaR-VI algorithm and proves near-optimal upper and lower bounds, with a general bound of and an improved bound when , complemented by a matching minimax lower bound ; a Worst-Path RL regime yields a bound of . These results establish minimax optimality for fixed risk levels and reveal distinct behavior in the small- regime. The approach leverages a robust Bellman operator and per- uncertainty sets, enabling a model-based, value-iteration framework with tight dependence on the horizon, state/action counts, and risk tolerance. The findings have practical implications for risk-sensitive decision-making in settings where samples are costly and safety is critical, and they open avenues for extending the framework to other coherent risk measures.

Abstract

In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and -rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an -optimal policy with at most samples, where is the discount factor, and are the sizes of the state and action spaces. Furthermore, when , the sample complexity improves to . We further show a minimax lower bound of . For a fixed risk level , our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level , called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of , where denotes the minimum non-zero reaching probability of the transition kernel.

Paper Structure

This paper contains 20 sections, 11 theorems, 114 equations, 1 table, 1 algorithm.

Key Result

Theorem 1

For any risk level $\tau \in (0,1]$, the number of samples needed by Algorithm ICVaR-VI to return an $\epsilon$-optimal policy with probability at least $1 - \delta$ is at most $\tilde{\mathcal{O}}\biggl(\frac{SA}{\tau^2(1-\gamma)^4\epsilon^2} \biggl).$ In addition, when $\tau \geq \gamma$, the samp

Theorems & Definitions (16)

  • Theorem 1: Sample Complexity Upper Bound
  • Remark 1
  • proof : Proof sketch of Theorem \ref{['Theorem upper']}.
  • Theorem 2: Sample Complexity Lower Bound
  • Remark 2
  • Theorem 3: Worst-Path RL Upper Bound
  • Remark 3
  • Theorem 4: Worst-Path RL Lower Bound
  • Lemma 1
  • Lemma 2: Robust_generative_model, Lemma 5
  • ...and 6 more