Table of Contents
Fetching ...

Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting

Mehrdad Moghimi, Anthony Coache, Hyejin Ku

TL;DR

This work introduces a stock-augmented distributional RL framework that enables general discounting and risk-sensitive objectives via $F_K$ functionals (including OCE). By augmenting the state with a reward stock and employing non-stationary, time-consistent DP, it addresses time-preference inconsistencies that arise with non-exponential discounts. The authors provide finite-horizon backward induction guarantees, multi-horizon approximations, and a principled infinite-horizon strategy with provable error bounds, plus practical algorithms (RIGOR) tested on Goal-Based Wealth Management and Atari 2600. Key results show improved performance over time-inconsistent baselines and robust handling of risk and discounting across tasks, highlighting discounting as a central design choice for expressive temporal and risk preferences. The framework lays groundwork for richer, safer decision-making in real-world settings with complex time and risk structures.

Abstract

Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.

Decoupling Time and Risk: Risk-Sensitive Reinforcement Learning with General Discounting

TL;DR

This work introduces a stock-augmented distributional RL framework that enables general discounting and risk-sensitive objectives via functionals (including OCE). By augmenting the state with a reward stock and employing non-stationary, time-consistent DP, it addresses time-preference inconsistencies that arise with non-exponential discounts. The authors provide finite-horizon backward induction guarantees, multi-horizon approximations, and a principled infinite-horizon strategy with provable error bounds, plus practical algorithms (RIGOR) tested on Goal-Based Wealth Management and Atari 2600. Key results show improved performance over time-inconsistent baselines and robust handling of risk and discounting across tasks, highlighting discounting as a central design choice for expressive temporal and risk preferences. The framework lays groundwork for richer, safer decision-making in real-world settings with complex time and risk structures.

Abstract

Distributional reinforcement learning (RL) is a powerful framework increasingly adopted in safety-critical domains for its ability to optimize risk-sensitive objectives. However, the role of the discount factor is often overlooked, as it is typically treated as a fixed parameter of the Markov decision process or tunable hyperparameter, with little consideration of its effect on the learned policy. In the literature, it is well-known that the discounting function plays a major role in characterizing time preferences of an agent, which an exponential discount factor cannot fully capture. Building on this insight, we propose a novel framework that supports flexible discounting of future rewards and optimization of risk measures in distributional RL. We provide a technical analysis of the optimality of our algorithms, show that our multi-horizon extension fixes issues raised with existing methodologies, and validate the robustness of our methods through extensive experiments. Our results highlight that discounting is a cornerstone in decision-making problems for capturing more expressive temporal and risk preferences profiles, with potential implications for real-world safety-critical applications.
Paper Structure (40 sections, 13 theorems, 80 equations, 7 figures, 4 tables, 3 algorithms)

This paper contains 40 sections, 13 theorems, 80 equations, 7 figures, 4 tables, 3 algorithms.

Key Result

Lemma 3.1

Let $K$ be an objective functional satisfying the indifference to mixture property. For any time $t < T$, let $\eta_{t+1}$ and $\eta'_{t+1}$ be two return distribution functions for time $t+1$. If $F_K \eta_{t+1} \geq F_K \eta'_{t+1}$, then for any time-$t$ policy $\pi_t$: $F_K (\mathcal{T}^d_{\pi_t

Figures (7)

  • Figure 1: The impact of Time-Consistency. Performance comparison between our proposed time-consistent framework (in blue) and the time-inconsistent approach of Fedus.etal2019a (in red). By correctly modeling the non-stationary optimal policy required for general discounting, our method achieves higher returns in American Put Option Trading (Fig. \ref{['fig:option']}) and Windy Lunar Lander (Fig. \ref{['fig:windylunarlander']}).
  • Figure 2: Utility functions for common OCE risk measures. Detailed formulations are provided in Appendix \ref{['app:risk_measures']}.
  • Figure 3: The preference reversals in GBWM. Monte-Carlo probabilities of achieving the two goals for a risk-neutral (Fig. \ref{['fig:gbwm-risk-neutral']}) and risk-sensitive (Fig. \ref{['fig:gbwm-risk-sensitive']}) agent. Colors indicate the discounting approach, and shapes represent different environment settings.
  • Figure 4: Relative performance improvement of our Time-Consistent algorithm across 50 Atari games. Each bar shows the percentage improvement in mean return for the Time-Consistent policy over the Time-Inconsistent baseline, averaged over 3 seeds per game. The Time-Consistent policy outperforms in 39 out of 50 games. Across all games, it achieves a mean improvement of 39.89% and a median improvement of 18.14%, demonstrating the benefits of maintaining time-consistency under hyperbolic discounting.
  • Figure 5: From interest rates to discount functions. Comparison of exponential with $\gamma = 0.99$ (Orange), hyperbolic with $k = 0.01$ (Green), and CIR-based discount factors with parameters $a = 0.05$, $b = 0.01$, $\sigma = 0.1$, and $r_0 = 0.01$ (Blue)
  • ...and 2 more figures

Theorems & Definitions (27)

  • Definition 2.1: Indifference to Scaling
  • Definition 2.2: Indifference to Mixtures
  • Lemma 3.1: Time-Dependent Monotonicity
  • Theorem 3.2: Optimality of Distributional Backward Induction
  • Remark 3.3: On Tie-Breaking and Uniqueness
  • Remark 3.4: On Lipschitz Continuity
  • Proposition 3.5: OCE Optimality Bound
  • Lemma 5.1: Asymptotic Risk-Neutrality
  • Theorem 5.2: Performance Bound for Approximate Infinite-Horizon Policy
  • Lemma E.1: Time-Dependent Monotonicity (Restated)
  • ...and 17 more