POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Jean-Baptiste Bouvier; Kartik Nagpal; Negar Mehr

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Jean-Baptiste Bouvier, Kartik Nagpal, Negar Mehr

TL;DR

This paper proposes POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment and proves that such policies exist and guarantee constraint satisfaction.

Abstract

In this paper, we seek to learn a robot policy guaranteed to satisfy state constraints. To encourage constraint satisfaction, existing RL algorithms typically rely on Constrained Markov Decision Processes and discourage constraint violations through reward shaping. However, such soft constraints cannot offer verifiable safety guarantees. To address this gap, we propose POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment. Our key insight is to force the learned policy to be affine around the unsafe set and use this affine region as a repulsive buffer to prevent trajectories from violating the constraint. We prove that such policies exist and guarantee constraint satisfaction. Our proposed framework is applicable to both systems with continuous and discrete state and action spaces and is agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of POLICEd RL to enforce hard constraints in robotic tasks while significantly outperforming existing methods.

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

TL;DR

Abstract

Paper Structure (21 sections, 6 theorems, 32 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 21 sections, 6 theorems, 32 equations, 11 figures, 1 table, 2 algorithms.

Introduction
Related works
Enforcing hard constraints on neural network outputs
Constraints in reinforcement learning
Relative degree of constraints
Black-box safety with control theory
Prior Work
Framework
Constrained Reinforcement Learning
Guaranteed satisfaction of hard constraints
Existence conditions
Implementation
Simulations
Inverted pendulum experiment
Robotic arm
...and 6 more sections

Key Result

Lemma 1

Buffer $\mathcal{B}$ of eq: buffer is a polytope.

Figures (11)

Figure 1: Illustration of closed-loop constrained RL.
Figure 2: Schematic illustration of POLICEd RL. To prevent state $s$ from violating an affine constraint represented by $Cs \leq d$, our POLICEd policy enforces $C\dot s \leq 0$ in buffer region $\mathcal{B}$ (blue) directly below the unsafe area (red). The POLICEd policy (arrows in the environment) is affine inside buffer region $\mathcal{B}$ (delimited by vertices $v_1, \hdots, v_4$), which allows us to easily verify whether trajectories can violate the constraint.
Figure 3: The three categories of constraint satisfaction with increasing guarantees of satisfaction.
Figure 4: Classification task of orange versus purple by a learned decision boundary (red) which is required to be affine inside the black square. POLICE police guarantees the DNN is affine in the region of interest.
Figure 5: State space $\mathcal{S}$ with arrows denoting state transitions under POLICEd policy $\mu_\theta$ for linear environment \ref{['eq: toy dynamics']}. The affine buffer $\mathcal{B}$ (green) pushes states away from the constraint line (red) before heading towards the target (cyan).
...and 6 more figures

Theorems & Definitions (16)

Lemma 1
proof
Definition 1
Lemma 2
proof
Theorem 1
proof
Remark 1
Corollary 1
proof
...and 6 more

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

TL;DR

Abstract

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (16)