Table of Contents
Fetching ...

Risk-Averse Total-Reward Reinforcement Learning

Xihong Su, Jia Lin Hau, Gersi Doko, Kishan Panaganti, Marek Petrik

TL;DR

This work develops model-free Q-learning algorithms for risk-averse total reward MDPs under the entropic risk measure ${\mathrm{ERM}}_{\beta}$ and entropic value-at-risk ${\mathrm{EVaR}}_{\alpha}$. It introduces an ERM-based Bellman operator enabled by the elicitability of ERM to support stochastic-gradient updates, and reduces EVaR-TRC to a finite set of ERM-TRC problems to obtain a $\\delta$-optimal policy. The authors prove almost-sure convergence of both the ERM-TRC Q-learning and the EVaR-TRC Q-learning under mild assumptions and validate their approach on tabular cliff-walking and gambler's ruin tasks, showing alignment with LP baselines as the sample size grows. This work provides a principled, scalable foundation for risk-averse reinforcement learning in undiscounted settings and motivates future development of function-approximation methods with convergence guarantees.

Abstract

Risk-averse total-reward Markov Decision Processes (MDPs) offer a promising framework for modeling and solving undiscounted infinite-horizon objectives. Existing model-based algorithms for risk measures like the entropic risk measure (ERM) and entropic value-at-risk (EVaR) are effective in small problems, but require full access to transition probabilities. We propose a Q-learning algorithm to compute the optimal stationary policy for total-reward ERM and EVaR objectives with strong convergence and performance guarantees. The algorithm and its optimality are made possible by ERM's dynamic consistency and elicitability. Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function.

Risk-Averse Total-Reward Reinforcement Learning

TL;DR

This work develops model-free Q-learning algorithms for risk-averse total reward MDPs under the entropic risk measure and entropic value-at-risk . It introduces an ERM-based Bellman operator enabled by the elicitability of ERM to support stochastic-gradient updates, and reduces EVaR-TRC to a finite set of ERM-TRC problems to obtain a -optimal policy. The authors prove almost-sure convergence of both the ERM-TRC Q-learning and the EVaR-TRC Q-learning under mild assumptions and validate their approach on tabular cliff-walking and gambler's ruin tasks, showing alignment with LP baselines as the sample size grows. This work provides a principled, scalable foundation for risk-averse reinforcement learning in undiscounted settings and motivates future development of function-approximation methods with convergence guarantees.

Abstract

Risk-averse total-reward Markov Decision Processes (MDPs) offer a promising framework for modeling and solving undiscounted infinite-horizon objectives. Existing model-based algorithms for risk measures like the entropic risk measure (ERM) and entropic value-at-risk (EVaR) are effective in small problems, but require full access to transition probabilities. We propose a Q-learning algorithm to compute the optimal stationary policy for total-reward ERM and EVaR objectives with strong convergence and performance guarantees. The algorithm and its optimality are made possible by ERM's dynamic consistency and elicitability. Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function.

Paper Structure

This paper contains 29 sections, 18 theorems, 90 equations, 6 figures, 3 algorithms.

Key Result

Lemma 2.1

Suppose that $f\colon \mathbb{R} \rightarrow \mathbb{R}$ is a differentiable $\mu-$strongly convex function with an $L-$Lipschitz continuous gradient. Consider $x_i \in \mathbb{R}$ and a gradient update for any step size $\xi \in (0,1/L]:$ Then $\exists l \in [1/L,1/\mu]$ such that $\xi / l \in (0,1]$ and where $x^{\star} = \mathop{\mathrm{arg\,min}}\limits_{x \in \mathbb{R}} f(x)$ is unique fro

Figures (6)

  • Figure 1: EVaR optimal policy with $\alpha=0.2$
  • Figure 2: EVaR optimal policy with $\alpha=0.6$
  • Figure 3: Comparison of return distributions of two optimal EVaR policies.
  • Figure 4: Mean and standard deviation of EVaR values with $\alpha = 0.2$ on CW domain
  • Figure 5: EVaR value converges on CW
  • ...and 1 more figures

Theorems & Definitions (31)

  • Lemma 2.1: Lemma C.13 in hau2024q
  • Theorem 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Remark 1
  • Remark 2
  • Lemma 3.4
  • Theorem 3.5: Theorem 4.2 in su2024stationary
  • Remark 3
  • Theorem 4.2
  • ...and 21 more