Table of Contents
Fetching ...

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Hao Liang, Zhi-Quan Luo

TL;DR

This work advances risk-sensitive reinforcement learning by embedding EntRM-based decisions into a distributional dynamic-programming framework, enabling regret guarantees in finite episodic MDPs. It introduces RS-DDP and two DRL algorithm families: model-free RODI-MF and model-based RODI-MB, plus a distribution-representation variant using Bernoulli projections (RODI-OTP, RODI-PTO) that preserve ERM values while maintaining tractable updates. The authors establish a near-optimal regret bound $\tilde{\mathcal{O}}\left( \frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S^2AK} \right)$ and a tighter minimax lower bound $\Omega\left( \frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT} \right)$ for $\beta>0$, connecting DRL practice to risk sensitivity and showing computationally efficient pathways to leverage distributional information. They also prove an equivalence between the distributional, model-based route and non-distributional ROVI, and provide extensive theoretical and empirical comparisons highlighting the benefits of distributional optimism over traditional bonus-based schemes. Overall, the paper delivers the first regret analyses bridging DRL with risk-sensitive objectives in finite-horizon MDPs and offers practical algorithms with scalable computation via distribution representation.

Abstract

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|β| H)-1}{|β|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of $Ω(\frac{\exp(βH/6)-1}{βH}H\sqrt{SAT})$ for the $β>0$ case, which recovers the tight lower bound $Ω(H\sqrt{SAT})$ in the risk-neutral setting.

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

TL;DR

This work advances risk-sensitive reinforcement learning by embedding EntRM-based decisions into a distributional dynamic-programming framework, enabling regret guarantees in finite episodic MDPs. It introduces RS-DDP and two DRL algorithm families: model-free RODI-MF and model-based RODI-MB, plus a distribution-representation variant using Bernoulli projections (RODI-OTP, RODI-PTO) that preserve ERM values while maintaining tractable updates. The authors establish a near-optimal regret bound and a tighter minimax lower bound for , connecting DRL practice to risk sensitivity and showing computationally efficient pathways to leverage distributional information. They also prove an equivalence between the distributional, model-based route and non-distributional ROVI, and provide extensive theoretical and empirical comparisons highlighting the benefits of distributional optimism over traditional bonus-based schemes. Overall, the paper delivers the first regret analyses bridging DRL with risk-sensitive objectives in finite-horizon MDPs and offers practical algorithms with scalable computation via distribution representation.

Abstract

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain regret upper bound, where , , , and represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of for the case, which recovers the tight lower bound in the risk-neutral setting.
Paper Structure (48 sections, 24 theorems, 216 equations, 1 figure, 1 table, 3 algorithms)

This paper contains 48 sections, 24 theorems, 216 equations, 1 figure, 1 table, 3 algorithms.

Key Result

Proposition 1

Let $\pi^{*}=\pi^*_{1:H}$ be an optimal policy. Fixing $h\in[H]$, then the truncated optimal policy $\pi^*_{h:H}$ is optimal for the sub-problem $\max_{\pi_{h:H}\in\Pi_{h:H}}V^{\pi}_h$.

Figures (1)

  • Figure 1: Comparison of regret for different algorithms.

Theorems & Definitions (36)

  • Proposition 1: Principle of optimality
  • Proposition 2: Distributional Bellman optimality equation
  • Definition 3: Tower property
  • Proposition 4: Equivalence between EntRM and EU
  • Lemma 5: Lipschitz property of EU
  • Definition 6
  • Lemma 7
  • Proposition 8: Optimism
  • Lemma 9: High probability good event
  • Lemma 10
  • ...and 26 more