Table of Contents
Fetching ...

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen, Mohammad Sadegh Talebi

TL;DR

This work studies the sample complexity of learning the optimal $Q$-function and policy in finite discounted MDPs under recursive entropic risk with parameter $\beta\neq0$, assuming access to a generative model. It introduces the model-based risk-sensitive Q-value iteration (MB-RS-QVI) and derives $(\varepsilon,\delta)$-PAC bounds for both $Q^*$-approximation and the resulting policy performance, revealing an exponential dependence on the effective horizon $\frac{1}{1-\gamma}$ that grows with $|\beta|$. The paper further proves matching lower bounds showing that this exponential dependence is unavoidable, highlighting a fundamental hardness gap relative to risk-neutral RL. These results establish the first concrete sample-complexity bounds for discounted, entropic-risk RL and motivate future exploration of model-free approaches and function approximation in risk-sensitive settings. Overall, the findings quantify the practical difficulty of risk-sensitive decision-making in long-horizon tasks and provide principled guidance for algorithm design under entropic risk.

Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\varepsilon,δ)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{π_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $π_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-γ}$ and the strength of this dependence grows with the learners risk-sensitivity $|β|$. We also provide two lower bounds which shows that exponential dependence on $|β|\frac{1}{1-γ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters $S,A,δ,\varepsilon$ and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

TL;DR

This work studies the sample complexity of learning the optimal -function and policy in finite discounted MDPs under recursive entropic risk with parameter , assuming access to a generative model. It introduces the model-based risk-sensitive Q-value iteration (MB-RS-QVI) and derives -PAC bounds for both -approximation and the resulting policy performance, revealing an exponential dependence on the effective horizon that grows with . The paper further proves matching lower bounds showing that this exponential dependence is unavoidable, highlighting a fundamental hardness gap relative to risk-neutral RL. These results establish the first concrete sample-complexity bounds for discounted, entropic-risk RL and motivate future exploration of model-free approaches and function approximation in risk-sensitive settings. Overall, the findings quantify the practical difficulty of risk-sensitive decision-making in long-horizon tasks and provide principled guidance for algorithm design under entropic risk.

Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function and an optimal policy in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive -value-iteration (MB-RS-QVI) which leads to -PAC-bounds on , and where is the output of MB-RS-QVI after k iterations and is the greedy policy with respect to . Both PAC-bounds have exponential dependence on the effective horizon and the strength of this dependence grows with the learners risk-sensitivity . We also provide two lower bounds which shows that exponential dependence on is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Paper Structure

This paper contains 34 sections, 16 theorems, 122 equations, 1 figure, 2 algorithms.

Key Result

Lemma 1

Fix a map $\pi:\mathcal{A}\rightarrow \mathcal{S}$. We then define the operators $\mathcal{T}^\pi,\mathcal{T}:\mathbb{R}^{S\times A}\rightarrow \mathbb{R}^{S\times A}$ which for $f:\mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}$ is given by The operators $\mathcal{T}$ and $\mathcal{T}^\pi$ are $\gamma$-contractions with respect to the max-norm, i.e., for value-functions $f_1$ and $f_2$, it

Figures (1)

  • Figure 1: Dynamics and rewards of the hard-to-learn MDP class

Theorems & Definitions (33)

  • Definition 1: $(\varepsilon,\delta)$-correctness
  • Lemma 1: Q-value iteration
  • Lemma 2
  • Lemma 3: Simulation Lemma with Entropic Risk
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • ...and 23 more