Table of Contents
Fetching ...

Is Risk-Sensitive Reinforcement Learning Properly Resolved?

Ruiwen Zhou, Minghuan Liu, Kan Ren, Xufang Luo, Weinan Zhang, Dongsheng Li

TL;DR

This paper tackles risk-sensitive reinforcement learning by showing that standard distributional RL methods biasedly optimize distortion-risk objectives over accumulated returns. It introduces Trajectory Q-Learning (TQL), a non-Markovian framework that models the history-return distribution via a history-conditioned value function and a history-dependent policy, with theoretical guarantees for policy evaluation and improvement under general distortion risk measures. The authors prove that conventional risk-aware Bellman operators are not contractive and can lead to biased or divergent optimization, motivating HR-based evaluation and a trajectory-focused learning scheme. Empirically, TQL demonstrates improved risk-sensitive performance on discrete MiniGrid and continuous Mountain-Car tasks, outperforming baselines especially under higher risk levels. The work provides a general, convergent approach to risk-sensitive policy learning and outlines avenues to scale to more complex, stochastic environments.

Abstract

Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.

Is Risk-Sensitive Reinforcement Learning Properly Resolved?

TL;DR

This paper tackles risk-sensitive reinforcement learning by showing that standard distributional RL methods biasedly optimize distortion-risk objectives over accumulated returns. It introduces Trajectory Q-Learning (TQL), a non-Markovian framework that models the history-return distribution via a history-conditioned value function and a history-dependent policy, with theoretical guarantees for policy evaluation and improvement under general distortion risk measures. The authors prove that conventional risk-aware Bellman operators are not contractive and can lead to biased or divergent optimization, motivating HR-based evaluation and a trajectory-focused learning scheme. Empirically, TQL demonstrates improved risk-sensitive performance on discrete MiniGrid and continuous Mountain-Car tasks, outperforming baselines especially under higher risk levels. The work provides a general, convergent approach to risk-sensitive policy learning and outlines avenues to scale to more complex, stochastic environments.

Abstract

Due to the nature of risk management in learning applicable policies, risk-sensitive reinforcement learning (RSRL) has been realized as an important direction. RSRL is usually achieved by learning risk-sensitive objectives characterized by various risk measures, under the framework of distributional reinforcement learning. However, it remains unclear if the distributional Bellman operator properly optimizes the RSRL objective in the sense of risk measures. In this paper, we prove that the existing RSRL methods do not achieve unbiased optimization and cannot guarantee optimality or even improvements regarding risk measures over accumulated return distributions. To remedy this issue, we further propose a novel algorithm, namely Trajectory Q-Learning (TQL), for RSRL problems with provable policy improvement towards the optimal policy. Based on our new learning architecture, we are free to introduce a general and practical implementation for different risk measures to learn disparate risk-sensitive policies. In the experiments, we verify the learnability of our algorithm and show how our method effectively achieves better performances toward risk-sensitive objectives.
Paper Structure (38 sections, 10 theorems, 62 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 38 sections, 10 theorems, 62 equations, 4 figures, 5 tables, 2 algorithms.

Key Result

Lemma 3.1

Recursively applying the distributional Bellman optimality operator $Z_{k+1}=\mathcal{T}^*Z_{k}$ on arbitrary value distribution $Z_0$ solves the objective Eq. (eqn:risk-sensitive-policy) when $\beta$ is exactly mean where the optimal policy is obtained via Eq. (eqn:conventional-mean-policy), and fo and in particular $\mathbb{E}[Z_k]\rightarrow \mathbb{E}[Z^*]$ exponentially quickly.

Figures (4)

  • Figure 1: Undiscounted 3-state MDP for which the optimality operator $\mathcal{T}_\beta^*$ does not converge and obtain non-optimal result. We highlight the entries that are incorrectly updated.
  • Figure 2: Mini-grid experiments designed for learning CVaR objective. (a) Illustration of risky mini-grid environment. The agent starts at the upper left corner of the grid (red triangle), and reaches the bottom right green grid to end the episode. At each timestep, the agent receives a constant penalty of $-2$. The yellow grids give a $+100$ bonus with the probability of $p=0.75$ and $0$ with the probability of $p=0.25$, while the blue grids always give a reward of $+20$. Each yellow or blue grid can give its reward only once. The orange grids have a heavy penalty of $-100$ to avoid the agent from going there. (b-c) Experiment results on the task: (b) Vanilla IQN quickly converges to a sub-optimal solution; CVaR-DRL discovers a slightly better policy; TQL finds the optimal policy. (c) The return distributions of vanilla IQN and CVaR-DRL are more conservative, while that of TQL results in a higher CVaR.
  • Figure 3: Predicted return distribution on different $s$ or $h$ and $a$ input. The left 4 figures correspond to IQN: IQN first learns $Z(s_1,\cdot)$, see (c-d). It finds $a_1=1$ better and keeps this strategy when learning $Z(s_0,\cdot)$, leading to (a) and (b); the right 6 figures correspond to our proposed method TQL: (e) matches (f) as taking $a_1=0$ has better CVaR after taking $a_0=0$; (h) matches (j) as taking $a_1=1$ has better CVaR after taking $a_0=1$. Overall, the policy corresponds to (e) and (f), which achieve global optimum.
  • Figure 4: Learning curves on modified Mountain-Car environment with different risk measures as objective, measured by risk measures. TQL shows a significant performance advantage over the IQTD3 baseline when exposed to larger risks ($c\in\{0.5, 0.75, 1.0\}$), and exhibits a comparable performance to baselines given smaller risks ($c\in\{0.25, 0.1, 0.0\}$).

Theorems & Definitions (16)

  • Lemma 3.1: Value iteration theorem bellemare2017c51
  • Theorem 3.2
  • Theorem 4.1: Policy Evaluation for $\mathcal{T}^\pi_h$
  • Theorem 4.2: Policy Improvement for $\mathcal{T}^*_{h,\beta}$
  • Theorem 4.3
  • Lemma C.1: Value iteration theorem
  • proof
  • Theorem C.2
  • proof
  • Theorem C.3: Policy Evaluation for $\mathcal{T}^\pi_h$
  • ...and 6 more