Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen; Mohammad Sadegh Talebi

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen, Mohammad Sadegh Talebi

TL;DR

This work studies the sample complexity of learning the optimal $Q$-function and policy in finite discounted MDPs under recursive entropic risk with parameter $\beta\neq0$, assuming access to a generative model. It introduces the model-based risk-sensitive Q-value iteration (MB-RS-QVI) and derives $(\varepsilon,\delta)$-PAC bounds for both $Q^*$-approximation and the resulting policy performance, revealing an exponential dependence on the effective horizon $\frac{1}{1-\gamma}$ that grows with $|\beta|$. The paper further proves matching lower bounds showing that this exponential dependence is unavoidable, highlighting a fundamental hardness gap relative to risk-neutral RL. These results establish the first concrete sample-complexity bounds for discounted, entropic-risk RL and motivate future exploration of model-free approaches and function approximation in risk-sensitive settings. Overall, the findings quantify the practical difficulty of risk-sensitive decision-making in long-horizon tasks and provide principled guidance for algorithm design under entropic risk.

Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $π^*$ in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $β\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\varepsilon,δ)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{π_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $π_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-γ}$ and the strength of this dependence grows with the learners risk-sensitivity $|β|$. We also provide two lower bounds which shows that exponential dependence on $|β|\frac{1}{1-γ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters $S,A,δ,\varepsilon$ and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

TL;DR

This work studies the sample complexity of learning the optimal

-function and policy in finite discounted MDPs under recursive entropic risk with parameter

, assuming access to a generative model. It introduces the model-based risk-sensitive Q-value iteration (MB-RS-QVI) and derives

-PAC bounds for both

-approximation and the resulting policy performance, revealing an exponential dependence on the effective horizon

that grows with

. The paper further proves matching lower bounds showing that this exponential dependence is unavoidable, highlighting a fundamental hardness gap relative to risk-neutral RL. These results establish the first concrete sample-complexity bounds for discounted, entropic-risk RL and motivate future exploration of model-free approaches and function approximation in risk-sensitive settings. Overall, the findings quantify the practical difficulty of risk-sensitive decision-making in long-horizon tasks and provide principled guidance for algorithm design under entropic risk.

Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function

and an optimal policy

in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter

and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive

-value-iteration (MB-RS-QVI) which leads to

-PAC-bounds on

, and

where

is the output of MB-RS-QVI after k iterations and

is the greedy policy with respect to

. Both PAC-bounds have exponential dependence on the effective horizon

and the strength of this dependence grows with the learners risk-sensitivity

. We also provide two lower bounds which shows that exponential dependence on

is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters

and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

TL;DR

Abstract

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (33)