Table of Contents
Fetching ...

State-wise Constrained Policy Optimization

Weiye Zhao, Rui Chen, Yifan Sun, Tianhao Wei, Changliu Liu

TL;DR

The proposed State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning, is introduced and it is proved that the worst-case safety violation is bounded under SCPO.

Abstract

Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

State-wise Constrained Policy Optimization

TL;DR

The proposed State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning, is introduced and it is proved that the worst-case safety violation is bounded under SCPO.

Abstract

Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
Paper Structure (48 sections, 9 theorems, 53 equations, 17 figures, 8 tables, 1 algorithm)

This paper contains 48 sections, 9 theorems, 53 equations, 17 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

For any policies $\pi', \pi$, with $\epsilon^{\pi'}_D \doteq \mathbf{max}_{{\hat{s}}}|\mathbb{E}_{a\sim\pi'}[A^{\pi}_D({\hat{s}},a)]|$, and define $\bar{d}^\pi = \sum_{t=0}^H P({\hat{s}}_t={\hat{s}}|\pi)$ as the non-discounted augmented state distribution using $\pi$, then the following bound holds:

Figures (17)

  • Figure 1: Intuition of the maximum state-wise cost: The three figures above illustrate the evolution of the maximum state-wise cost, denoted as ${M}$ (shown by the red line), across a single episode. The orange curve represents the state-wise cost, while the green lines with arrows labeled as ${D}$ indicate the increments of M at each step. Steps with ${D = 0}$ are not labeled in the figures.
  • Figure 2: Comparison of results from two representative test suites in high dimensional systems (Ant and Walker).
  • Figure 3: Robots for benchmark problems in upgraded Safety Gym.
  • Figure 4: Constraints for benchmark problems in upgraded Safety Gym.
  • Figure 5: Comparison of results from (i) four representative test suites in low dimensional systems (Point, Swimmer, Drone), (ii) Arm reaching, and (iii) Humanoid locomotion.
  • ...and 12 more figures

Theorems & Definitions (16)

  • Theorem 1: Trust Region Update State-wise Maximum Cost Bound
  • Proposition 1: SCPO Update Constraint Satisfaction
  • Proposition 2: SCPO Update Worst Performance Degradation
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 6 more