State-wise Constrained Policy Optimization

Weiye Zhao; Rui Chen; Yifan Sun; Tianhao Wei; Changliu Liu

State-wise Constrained Policy Optimization

Weiye Zhao, Rui Chen, Yifan Sun, Tianhao Wei, Changliu Liu

TL;DR

The proposed State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning, is introduced and it is proved that the worst-case safety violation is bounded under SCPO.

Abstract

Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

State-wise Constrained Policy Optimization

TL;DR

Abstract

Paper Structure (48 sections, 9 theorems, 53 equations, 17 figures, 8 tables, 1 algorithm)

This paper contains 48 sections, 9 theorems, 53 equations, 17 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Cumulative Safety
State-wise Safety
Hierarchical Policy
End-to-End Policy
Problem Formulation
Preliminaries
State-wise Constrained Markov Decision Process
Maximum Markov Decision Process
State-wise Constrained Policy Optimization
Theoretical Guarantees for SCPO
Practical Implementation
Practical implementation with sample-based estimation
Infeasible constraints
...and 33 more sections

Key Result

Theorem 1

For any policies $\pi', \pi$, with $\epsilon^{\pi'}_D \doteq \mathbf{max}_{{\hat{s}}}|\mathbb{E}_{a\sim\pi'}[A^{\pi}_D({\hat{s}},a)]|$, and define $\bar{d}^\pi = \sum_{t=0}^H P({\hat{s}}_t={\hat{s}}|\pi)$ as the non-discounted augmented state distribution using $\pi$, then the following bound holds:

Figures (17)

Figure 1: Intuition of the maximum state-wise cost: The three figures above illustrate the evolution of the maximum state-wise cost, denoted as ${M}$ (shown by the red line), across a single episode. The orange curve represents the state-wise cost, while the green lines with arrows labeled as ${D}$ indicate the increments of M at each step. Steps with ${D = 0}$ are not labeled in the figures.
Figure 2: Comparison of results from two representative test suites in high dimensional systems (Ant and Walker).
Figure 3: Robots for benchmark problems in upgraded Safety Gym.
Figure 4: Constraints for benchmark problems in upgraded Safety Gym.
Figure 5: Comparison of results from (i) four representative test suites in low dimensional systems (Point, Swimmer, Drone), (ii) Arm reaching, and (iii) Humanoid locomotion.
...and 12 more figures

Theorems & Definitions (16)

Theorem 1: Trust Region Update State-wise Maximum Cost Bound
Proposition 1: SCPO Update Constraint Satisfaction
Proposition 2: SCPO Update Worst Performance Degradation
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
...and 6 more

State-wise Constrained Policy Optimization

TL;DR

Abstract

State-wise Constrained Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (16)