Integrating LTL Constraints into PPO for Safe Reinforcement Learning

Maifang Zhang; Hang Yu; Qian Zuo; Cheng Wang; Vaishak Belle; Fengxiang He

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

Maifang Zhang, Hang Yu, Qian Zuo, Cheng Wang, Vaishak Belle, Fengxiang He

TL;DR

Proximal Policy Optimization with Linear Temporal Logic Constraints with Linear Temporal Logic Constraints (PPO-LTL) is proposed, a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning.

Abstract

This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit-deterministic Büchi automata, and then translated by a logic-to-cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO-LTL can consistently reduce safety violations, while maintaining competitive performance, against the state-of-the-art methods. The code is at https://github.com/EVIEHub/PPO-LTL.

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (11 sections, 1 theorem, 45 equations, 1 figure, 4 tables)

This paper contains 11 sections, 1 theorem, 45 equations, 1 figure, 4 tables.

Introduction
Related Work
Preliminaries
PPO-LTL
Constraints as LTL Specifications
Logic-to-Cost Mechanism
The Langragian Scheme in PPO-LTL
Theoretical Guarantee
Experiments
Proof
Conclusion

Key Result

Theorem 1

Conditioned on Assumptions assume1 and assume2 and the learning rates $0<\alpha\le 1/(4L_{\mathcal{L}})$ and $\beta>0$, let $\{\theta_t,\lambda_t\}_{t\ge0}$ be defined by eq1. Then, for all $T\ge1$, where $\Delta_\mathcal{L}=\sup_{\Theta\times[0,\Lambda]}\mathcal{L}(\theta,\lambda)-\inf_{\Theta\times[0,\Lambda]}\mathcal{L}(\theta,\lambda)<\infty$, $U_{\max} =\sup_{\theta\in\Theta}\left|J_C(\theta

Figures (1)

Figure 1: PPO-LTL: environment states are labeled with atomic propositions, monitored by LTL checkers to generate constraint costs, which are integrated with task rewards for policy optimization.

Theorems & Definitions (1)

Theorem 1

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

TL;DR

Abstract

Integrating LTL Constraints into PPO for Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (1)