Table of Contents
Fetching ...

DRL-ORA: Distributional Reinforcement Learning with Online Risk Adaption

Yupeng Wu, Wenyun Li, Wenjie Huang, Chin Pang Ho

TL;DR

This work proposes a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA), which quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online.

Abstract

One of the main challenges in reinforcement learning (RL) is that the agent has to make decisions that would influence the future performance without having complete knowledge of the environment. Dynamically adjusting the level of epistemic risk during the learning process can help to achieve reliable policies in safety-critical settings with better efficiency. In this work, we propose a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA). This framework quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online. The framework unifies the existing variants of risk adaption approaches and offers better explainability and flexibility. The selection of risk levels is performed efficiently via a grid search using a Follow-The-Leader-type algorithm, where the offline oracle also corresponds to a ''satisficing measure'' under a specially modified loss function. We show that DRL-ORA outperforms existing methods that rely on fixed risk levels or manually designed risk level adaptation in multiple classes of tasks.

DRL-ORA: Distributional Reinforcement Learning with Online Risk Adaption

TL;DR

This work proposes a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA), which quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online.

Abstract

One of the main challenges in reinforcement learning (RL) is that the agent has to make decisions that would influence the future performance without having complete knowledge of the environment. Dynamically adjusting the level of epistemic risk during the learning process can help to achieve reliable policies in safety-critical settings with better efficiency. In this work, we propose a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA). This framework quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online. The framework unifies the existing variants of risk adaption approaches and offers better explainability and flexibility. The selection of risk levels is performed efficiently via a grid search using a Follow-The-Leader-type algorithm, where the offline oracle also corresponds to a ''satisficing measure'' under a specially modified loss function. We show that DRL-ORA outperforms existing methods that rely on fixed risk levels or manually designed risk level adaptation in multiple classes of tasks.
Paper Structure (20 sections, 8 theorems, 46 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 20 sections, 8 theorems, 46 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Theorem 4

For an arbitrarily small $\epsilon>0$, the set $\mathcal{A}$ can be properly discretized as $\mathcal{A}^{\prime}$, such that the Hausdorff distance between the two sets, i.e., In addition, by choosing $\epsilon = O(T^{-1/2})$, Algorithm 1 can achieve $O(T^{1/2})$ expected regret complexity.

Figures (8)

  • Figure 1: "IQN alpha:1" represents that $\alpha$ is fixed at $0.1$ throughout all episodes. "IQN alpha:191" means that $\alpha$ is manually adjusted over the episodes, linearly increasing from $0.1$ to $0.9$ and then linearly decreasing back to $0.1$. Same interpretation applies to the other settings.
  • Figure 2: Average episodic scores in Nano Drone navigation task. The shaded area represents a 90% confidence interval.
  • Figure 3: Testing results with 90% confidence interval. "Comp." means Composite IQN. The "Optimal episodic reward" is the benchmark solved via DP.
  • Figure 4: Reward lines on Knapsack.
  • Figure 5: Graphic Illustration of Problem (\ref{['Transmin']})
  • ...and 3 more figures

Theorems & Definitions (16)

  • Example 1
  • Example 2: acerbi2002coherence
  • Example 3: dhaene2012remarks
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Lemma 7
  • proof
  • Lemma 8
  • proof
  • ...and 6 more