Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting
Mehrdad Moghimi, Anthony Coache, Hyejin Ku
TL;DR
This work introduces a stock-augmented distributional RL framework that enables general discounting and risk-sensitive objectives via $F_K$ functionals (including OCE). By augmenting the state with a reward stock and employing non-stationary, time-consistent DP, it addresses time-preference inconsistencies that arise with non-exponential discounts. The authors provide finite-horizon backward induction guarantees, multi-horizon approximations, and a principled infinite-horizon strategy with provable error bounds, plus practical algorithms (RIGOR) tested on Goal-Based Wealth Management and Atari 2600. Key results show improved performance over time-inconsistent baselines and robust handling of risk and discounting across tasks, highlighting discounting as a central design choice for expressive temporal and risk preferences. The framework lays groundwork for richer, safer decision-making in real-world settings with complex time and risk structures.
Abstract
Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.
