Table of Contents
Fetching ...

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, Jun Zhu

TL;DR

The paper investigates safety in deep reinforcement learning when both transition and observation uncertainties are present.It introduces Value Function Range (VFR) to quantify robustness and adopts CVaR-based constraints to avoid overly pessimistic policies, culminating in the CPPO algorithm.CPPO solves a CVaR-constrained, PPO-based optimization with an adaptive risk threshold, balancing performance and safety.Empirical results on MuJoCo continuous-control tasks show CPPO achieves competitive rewards and superior robustness to perturbations compared with strong on-policy baselines.

Abstract

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

TL;DR

The paper investigates safety in deep reinforcement learning when both transition and observation uncertainties are present.It introduces Value Function Range (VFR) to quantify robustness and adopts CVaR-based constraints to avoid overly pessimistic policies, culminating in the CPPO algorithm.CPPO solves a CVaR-constrained, PPO-based optimization with an adaptive risk threshold, balancing performance and safety.Empirical results on MuJoCo continuous-control tasks show CPPO achieves competitive rewards and superior robustness to perturbations compared with strong on-policy baselines.

Abstract

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.
Paper Structure (26 sections, 5 theorems, 55 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 26 sections, 5 theorems, 55 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

For any policy $\pi$ in MDP $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$ and any disturbed environment $\hat{\mathcal{M}} = (\mathcal{S}, \mathcal{A}, \hat{\mathcal{P}}, \mathcal{R}, \gamma)$, the reduction of the cumulative reward against the transition disturbance i Furthermore, an upper bound of the reduction is

Figures (4)

  • Figure 1: Cumulative reward curves for VPG, TRPO, PPO, PG-CMDP and our CPPO. The x-axes indicate the number of steps interacting with the environment, and the y-axes indicate the performance of the agent, including average rewards with standard deviations.
  • Figure 2: Cumulative reward curves for VPG, TRPO, PPO, PG-CMDP and our CPPO under transition disturbance. The x-axes indicate the mass of the agent, and the y-axes indicate the average performance of the algorithm when the mass changes.
  • Figure 3: Cumulative reward curves for VPG, TRPO, PPO, PG-CMDP and our CPPO under observation disturbance. The x-axes indicate the range of the disturbance, and the y-axes indicate the average performance of the algorithm under the state disturbance.
  • Figure 4: Cumulative reward curves for VPG, TRPO, PPO, PG-CMDP and our CPPO under observation adversarial noises. The x-axes indicates the range of the disturbance, and the y-axes indicates the average performance of the algorithm under the state observation adversarial noises.

Theorems & Definitions (12)

  • Definition 1: VaR and CVaR
  • Definition 2: Value Function Range
  • Theorem 1
  • Theorem 2
  • Theorem 3: Proof in Appendix \ref{['thm_pf_5']}
  • Theorem 4: Proof in Appendix \ref{['proof_lower_bound']}
  • Lemma 1
  • proof
  • proof
  • proof
  • ...and 2 more