Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model
Oliver Mortensen, Mohammad Sadegh Talebi
TL;DR
This work studies the sample complexity of learning the optimal $Q$-function and policy in finite discounted MDPs under recursive entropic risk with parameter $\beta\neq0$, assuming access to a generative model. It introduces the model-based risk-sensitive Q-value iteration (MB-RS-QVI) and derives $(\varepsilon,\delta)$-PAC bounds for both $Q^*$-approximation and the resulting policy performance, revealing an exponential dependence on the effective horizon $\frac{1}{1-\gamma}$ that grows with $|\beta|$. The paper further proves matching lower bounds showing that this exponential dependence is unavoidable, highlighting a fundamental hardness gap relative to risk-neutral RL. These results establish the first concrete sample-complexity bounds for discounted, entropic-risk RL and motivate future exploration of model-free approaches and function approximation in risk-sensitive settings. Overall, the findings quantify the practical difficulty of risk-sensitive decision-making in long-horizon tasks and provide principled guidance for algorithm design under entropic risk.
Abstract
In this paper, we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\varepsilon,δ)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{π_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $π_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-γ}$ and the strength of this dependence grows with the learners risk-sensitivity $|β|$. We also provide two lower bounds which shows that exponential dependence on $|β|\frac{1}{1-γ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters $S,A,δ,\varepsilon$ and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.
