Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
Zilong Deng, Simon Khan, Shaofeng Zou
TL;DR
The paper tackles the sample complexity of Iterated CVaR RL with a generative model, formulating the problem via CVaR's dual to connect with distributionally robust RL. It develops the ICVaR-VI algorithm and proves near-optimal upper and lower bounds, with a general bound of $\tilde{O}\left(\frac{SA}{\tau^2(1-\\gamma)^4\\epsilon^2}\right)$ and an improved bound when $\\tau \geq\\gamma$, complemented by a matching minimax lower bound $\tilde{O}\left(\frac{(1-\\gamma\\tau)SA}{(1-\\gamma)^4\\tau\\epsilon^2}\right)$; a Worst-Path RL regime yields a bound of $\tilde{O}\left(\frac{SA}{p_{\\min}}\right)$. These results establish minimax optimality for fixed risk levels and reveal distinct behavior in the small-$\\tau$ regime. The approach leverages a robust Bellman operator and per-$(s,a)$ uncertainty sets, enabling a model-based, value-iteration framework with tight dependence on the horizon, state/action counts, and risk tolerance. The findings have practical implications for risk-sensitive decision-making in settings where samples are costly and safety is critical, and they open avenues for extending the framework to other coherent risk measures.
Abstract
In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $τ$ at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and $(s, a)$-rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $ε$-optimal policy with at most $\tilde{O} \left(\frac{SA}{(1-γ)^4τ^2ε^2} \right)$ samples, where $γ$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, when $τ\geq γ$, the sample complexity improves to $\tilde{O} \left( \frac{SA}{(1-γ)^3ε^2} \right)$. We further show a minimax lower bound of $\tilde{O} \left(\frac{(1-γτ)SA}{(1-γ)^4τε^2} \right)$. For a fixed risk level $τ\in (0,1]$, our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level $τ$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $\tilde{O} \left(\frac{SA}{p_{\min}} \right)$, where $p_{\min}$ denotes the minimum non-zero reaching probability of the transition kernel.
