Table of Contents
Fetching ...

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

Zhengqi Wu, Renyuan Xu

TL;DR

The paper addresses learning in risk-sensitive Markov decision processes with general utility functions by augmenting the state space with cumulative rewards to restore a dynamic programming framework. It introduces a discretized enlarged environment via an ε_o-covering, and proves near-Lipschitz properties of the optimal value function to control discretization error. Two algorithms are developed: VIGU, with simulator access, achieving a near-optimal policy with sample complexity roughly $ ilde{O}(H^7 S A kappa^2 / kappa^2 ceil)$; and VIGU-UCB, without a simulator, delivering regret bounds that scale polynomially with model parameters and match fundamental lower bounds up to polylog factors. A novel regret lower bound for risk-sensitive RL with general utilities is established, showing the proposed methods are nearly optimal in their dependence on the action set and time horizon, with remaining gaps due to discretization and Lipschitz constants. Overall, the work provides the first comprehensive finite-sample and regret guarantees for RL under general utility functions, offering a principled path for risk-aware sequential decision making in complex environments.

Abstract

Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. Finally, we establish a novel theoretical regret lower bound for the risk-sensitive setting, and show that the regret of our algorithm matches this lower bound up to a small polynomial factor

Risk-sensitive Markov Decision Process and Learning under General Utility Functions

TL;DR

The paper addresses learning in risk-sensitive Markov decision processes with general utility functions by augmenting the state space with cumulative rewards to restore a dynamic programming framework. It introduces a discretized enlarged environment via an ε_o-covering, and proves near-Lipschitz properties of the optimal value function to control discretization error. Two algorithms are developed: VIGU, with simulator access, achieving a near-optimal policy with sample complexity roughly ; and VIGU-UCB, without a simulator, delivering regret bounds that scale polynomially with model parameters and match fundamental lower bounds up to polylog factors. A novel regret lower bound for risk-sensitive RL with general utilities is established, showing the proposed methods are nearly optimal in their dependence on the action set and time horizon, with remaining gaps due to discretization and Lipschitz constants. Overall, the work provides the first comprehensive finite-sample and regret guarantees for RL under general utility functions, offering a principled path for risk-aware sequential decision making in complex environments.

Abstract

Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. Finally, we establish a novel theoretical regret lower bound for the risk-sensitive setting, and show that the regret of our algorithm matches this lower bound up to a small polynomial factor
Paper Structure (27 sections, 23 theorems, 127 equations, 2 algorithms)

This paper contains 27 sections, 23 theorems, 127 equations, 2 algorithms.

Key Result

Theorem 1

Let $\pi\in\Pi$ be a Markovian policy for $\texttt{RS-MDP}({\mathcal{S}}, {\mathcal{A}}, H, \mathop{\mathrm{\mathbb{P}}}\nolimits^S,R,U)$ on the enlarged state space. Under the condition that $U$ is continuous and strictly increasing, the following results hold:

Theorems & Definitions (50)

  • Definition 1: Enlarged State Space and Transition Kernel
  • Definition 2: Markovian Policy, Value Function and Bellman Operator
  • Theorem 1: Bellman Optimality Conditions and Markovian Policy Equivalence
  • Proposition 1: Structural Properties of RS-MDP
  • Theorem 2
  • Remark 1: Remark of Theorem \ref{['thm:Lipschitz']}
  • Proposition 2
  • Proposition 3: "Near-Lipschitz" Property of a Near-optimal Policy
  • Proposition 4
  • Definition 3: $\epsilon$-optimal Markovian Policy
  • ...and 40 more