Planning and Learning in Average Risk-aware MDPs
Weikai Wang, Erick Delage
TL;DR
The paper addresses planning and learning for average-cost MDPs under dynamic risk measures, introducing a general risk-aware relative value iteration (RVI) and two model-free Q-learning approaches. It develops a MLMC-based unbiased estimator to enable unbiased risk-aware Q-learning for a broad class of risk measures, and a UBSR-focused off-policy Q-learning algorithm. Theoretical results guarantee convergence for the planning and MLMC-based learning, while experiments validate convergence, compare methods, and show that policies can be tuned to reflect specific risk preferences across real-world tasks. This work broadens the applicability of risk-sensitive reinforcement learning to average-cost settings and provides practical, convergent tools for identifying risk-aware policies in complex environments.
Abstract
For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
