Table of Contents
Fetching ...

Planning and Learning in Average Risk-aware MDPs

Weikai Wang, Erick Delage

TL;DR

The paper addresses planning and learning for average-cost MDPs under dynamic risk measures, introducing a general risk-aware relative value iteration (RVI) and two model-free Q-learning approaches. It develops a MLMC-based unbiased estimator to enable unbiased risk-aware Q-learning for a broad class of risk measures, and a UBSR-focused off-policy Q-learning algorithm. Theoretical results guarantee convergence for the planning and MLMC-based learning, while experiments validate convergence, compare methods, and show that policies can be tuned to reflect specific risk preferences across real-world tasks. This work broadens the applicability of risk-sensitive reinforcement learning to average-cost settings and provides practical, convergent tools for identifying risk-aware policies in complex environments.

Abstract

For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

Planning and Learning in Average Risk-aware MDPs

TL;DR

The paper addresses planning and learning for average-cost MDPs under dynamic risk measures, introducing a general risk-aware relative value iteration (RVI) and two model-free Q-learning approaches. It develops a MLMC-based unbiased estimator to enable unbiased risk-aware Q-learning for a broad class of risk measures, and a UBSR-focused off-policy Q-learning algorithm. Theoretical results guarantee convergence for the planning and MLMC-based learning, while experiments validate convergence, compare methods, and show that policies can be tuned to reflect specific risk preferences across real-world tasks. This work broadens the applicability of risk-sensitive reinforcement learning to average-cost settings and provides practical, convergent tools for identifying risk-aware policies in complex environments.

Abstract

For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

Paper Structure

This paper contains 33 sections, 26 theorems, 102 equations, 7 figures, 3 tables, 5 algorithms.

Key Result

Theorem 2.13

Under Assumption assump-AROE-Doeblin, there exists a unique $g^* \in \mathbb{R}$ and an $h^* \in \mathcal{L}(\mathcal{X})$ satisfying the average risk optimality equation (AROE): Moreover, $g^* = J^* = J_\infty(\bm{\pi}^*)$, for the stationary deterministic policy $\pi_t^*(a|x)=\bm{1}\{a=a^*(x)\}$, where $a^*(x)$ minimizes $c(x,a) + \mathcal{R}(h^*|x,a)$, and $g^*$ is independent of $x_0$.

Figures (7)

  • Figure 5.1: Convergence experiments for risk-aware RVI \ref{['algo-RVI-generalR']} and MLMC Q-learning \ref{['algo-RVIQ-generalR']}.
  • Figure 5.2: Comparison of MLMC and UBSR Q-learning with equivalent number of samples.
  • Figure C.1: Convergence of the synchronous UBSR Q-learning algorithm \ref{['algo-RVIQ-UBSR']} for polynomial mixed utility and soft quantile.
  • Figure C.2: Convergence of the asynchronous UBSR Q-learning algorithm \ref{['algo-RVIQ-UBSR-Asynchronous']} for polynomial mixed utility and soft quantile.
  • Figure C.3: Convergence rate of synchronous UBSR Q-learning for expectile under different $\tau$.
  • ...and 2 more figures

Theorems & Definitions (62)

  • Definition 2.1
  • Definition 2.2: Definition 4.112, Follmer2016
  • Example 2.3: Expected value
  • Example 2.4: Entropic risk measure
  • Example 2.5: Expectile
  • Definition 2.6: Definition 2.1, BenTal2007mafi
  • Definition 2.7: Definition 3.1, Acerbi2002jbf
  • Example 2.8: Conditional Value-at-Risk
  • Example 2.9: Mean-CVaR
  • Definition 2.10
  • ...and 52 more