Risk-Averse Total-Reward Reinforcement Learning

Xihong Su; Jia Lin Hau; Gersi Doko; Kishan Panaganti; Marek Petrik

Risk-Averse Total-Reward Reinforcement Learning

Xihong Su, Jia Lin Hau, Gersi Doko, Kishan Panaganti, Marek Petrik

TL;DR

This work develops model-free Q-learning algorithms for risk-averse total reward MDPs under the entropic risk measure ${\mathrm{ERM}}_{\beta}$ and entropic value-at-risk ${\mathrm{EVaR}}_{\alpha}$. It introduces an ERM-based Bellman operator enabled by the elicitability of ERM to support stochastic-gradient updates, and reduces EVaR-TRC to a finite set of ERM-TRC problems to obtain a $\\delta$-optimal policy. The authors prove almost-sure convergence of both the ERM-TRC Q-learning and the EVaR-TRC Q-learning under mild assumptions and validate their approach on tabular cliff-walking and gambler's ruin tasks, showing alignment with LP baselines as the sample size grows. This work provides a principled, scalable foundation for risk-averse reinforcement learning in undiscounted settings and motivates future development of function-approximation methods with convergence guarantees.

Abstract

Risk-averse total-reward Markov Decision Processes (MDPs) offer a promising framework for modeling and solving undiscounted infinite-horizon objectives. Existing model-based algorithms for risk measures like the entropic risk measure (ERM) and entropic value-at-risk (EVaR) are effective in small problems, but require full access to transition probabilities. We propose a Q-learning algorithm to compute the optimal stationary policy for total-reward ERM and EVaR objectives with strong convergence and performance guarantees. The algorithm and its optimality are made possible by ERM's dynamic consistency and elicitability. Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function.

Risk-Averse Total-Reward Reinforcement Learning

TL;DR

This work develops model-free Q-learning algorithms for risk-averse total reward MDPs under the entropic risk measure

and entropic value-at-risk

. It introduces an ERM-based Bellman operator enabled by the elicitability of ERM to support stochastic-gradient updates, and reduces EVaR-TRC to a finite set of ERM-TRC problems to obtain a

-optimal policy. The authors prove almost-sure convergence of both the ERM-TRC Q-learning and the EVaR-TRC Q-learning under mild assumptions and validate their approach on tabular cliff-walking and gambler's ruin tasks, showing alignment with LP baselines as the sample size grows. This work provides a principled, scalable foundation for risk-averse reinforcement learning in undiscounted settings and motivates future development of function-approximation methods with convergence guarantees.

Risk-Averse Total-Reward Reinforcement Learning

TL;DR

Abstract

Risk-Averse Total-Reward Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (31)