Is Risk-Sensitive Reinforcement Learning Properly Resolved?
Ruiwen Zhou, Minghuan Liu, Kan Ren, Xufang Luo, Weinan Zhang, Dongsheng Li
TL;DR
This paper tackles risk-sensitive reinforcement learning by showing that standard distributional RL methods biasedly optimize distortion-risk objectives over accumulated returns. It introduces Trajectory Q-Learning (TQL), a non-Markovian framework that models the history-return distribution via a history-conditioned value function and a history-dependent policy, with theoretical guarantees for policy evaluation and improvement under general distortion risk measures. The authors prove that conventional risk-aware Bellman operators are not contractive and can lead to biased or divergent optimization, motivating HR-based evaluation and a trajectory-focused learning scheme. Empirically, TQL demonstrates improved risk-sensitive performance on discrete MiniGrid and continuous Mountain-Car tasks, outperforming baselines especially under higher risk levels. The work provides a general, convergent approach to risk-sensitive policy learning and outlines avenues to scale to more complex, stochastic environments.
Abstract
Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.
