Table of Contents
Fetching ...

Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems

Jean-Baptiste Bouvier, Kartik Nagpal, Negar Mehr

TL;DR

This work tackles the problem of enforcing a hard affine safety constraint $y=Cx \le y_{max}$ with relative degree $r \ge 2$ in black-box dynamical systems. It introduces High Relative Degree POLICEd RL, which builds a buffer around the unsafe set in transformed coordinates $s=T(x)$ and learns an affine policy $\mu_\theta(s)=D_\theta s+e_\theta$ on the buffer, coupled with an affine surrogate for the $r$-th derivative and an over-approximation error $\varepsilon$. A central safety result (Theorem) shows that if the dissipation condition $\tilde f_r(v;\mu_\theta) \le -2\varepsilon - \beta v_r$ holds at the buffer vertices, trajectories entering the buffer cannot cross the constraint boundary, despite the system being black-box. The approach is validated on an inverted pendulum and a space shuttle landing scenario, where POLICEd trajectories entering the buffer guarantee constraint satisfaction and safe landings, illustrating the method's potential for provable safety in high-relative-degree control with unknown dynamics. Overall, the paper advances safe RL by enabling hard constraint satisfaction for high relative degree in black-box environments with theoretical guarantees and practical demonstrations.

Abstract

In this paper, we develop a method for learning a control policy guaranteed to satisfy an affine state constraint of high relative degree in closed loop with a black-box system. Previous reinforcement learning (RL) approaches to satisfy safety constraints either require access to the system model, or assume control affine dynamics, or only discourage violations with reward shaping. Only recently have these issues been addressed with POLICEd RL, which guarantees constraint satisfaction for black-box systems. However, this previous work can only enforce constraints of relative degree 1. To address this gap, we build a novel RL algorithm explicitly designed to enforce an affine state constraint of high relative degree in closed loop with a black-box control system. Our key insight is to make the learned policy be affine around the unsafe set and to use this affine region to dissipate the inertia of the high relative degree constraint. We prove that such policies guarantee constraint satisfaction for deterministic systems while being agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of our approach to enforce hard constraints in the Gym inverted pendulum and on a space shuttle landing simulation.

Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems

TL;DR

This work tackles the problem of enforcing a hard affine safety constraint with relative degree in black-box dynamical systems. It introduces High Relative Degree POLICEd RL, which builds a buffer around the unsafe set in transformed coordinates and learns an affine policy on the buffer, coupled with an affine surrogate for the -th derivative and an over-approximation error . A central safety result (Theorem) shows that if the dissipation condition holds at the buffer vertices, trajectories entering the buffer cannot cross the constraint boundary, despite the system being black-box. The approach is validated on an inverted pendulum and a space shuttle landing scenario, where POLICEd trajectories entering the buffer guarantee constraint satisfaction and safe landings, illustrating the method's potential for provable safety in high-relative-degree control with unknown dynamics. Overall, the paper advances safe RL by enabling hard constraint satisfaction for high relative degree in black-box environments with theoretical guarantees and practical demonstrations.

Abstract

In this paper, we develop a method for learning a control policy guaranteed to satisfy an affine state constraint of high relative degree in closed loop with a black-box system. Previous reinforcement learning (RL) approaches to satisfy safety constraints either require access to the system model, or assume control affine dynamics, or only discourage violations with reward shaping. Only recently have these issues been addressed with POLICEd RL, which guarantees constraint satisfaction for black-box systems. However, this previous work can only enforce constraints of relative degree 1. To address this gap, we build a novel RL algorithm explicitly designed to enforce an affine state constraint of high relative degree in closed loop with a black-box control system. Our key insight is to make the learned policy be affine around the unsafe set and to use this affine region to dissipate the inertia of the high relative degree constraint. We prove that such policies guarantee constraint satisfaction for deterministic systems while being agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of our approach to enforce hard constraints in the Gym inverted pendulum and on a space shuttle landing simulation.
Paper Structure (12 sections, 6 theorems, 38 equations, 7 figures, 1 table)

This paper contains 12 sections, 6 theorems, 38 equations, 7 figures, 1 table.

Key Result

Theorem 1

Assume that for some approximation measure $\varepsilon$, dissipation condition holds for all $v \in \mathcal{V}(\mathcal{B})$, where $v_r$ is the $r^{th}$ component of $v$ and $\beta$ comes from eq: beta. If a trajectory $s$ steered by $\mu_\theta$ verifies for some $t_0 \geq 0$, and satisfies for all $t \in [t_0, t_1)$, then $s_{1:r}(t) < \overline{b}( s(t) )$ for all $t \in [t_0, t_1)$.

Figures (7)

  • Figure 1: Phase portrait of constrained output $y$ illustrating our High Relative Degree POLICEd RL method on a system of relative degree $2$. To prevent states from violating constraint $y \leq y_{max}$ (red dashed line), our policy guarantees that trajectories entering buffer region $\mathcal{B}$ (blue) cannot leave it through its upper bound (blue dotted line). Our policy makes $\ddot y$ sufficiently negative in buffer $\mathcal{B}$ to bring $\dot y$ to $0$ in all trajectories entering $\mathcal{B}$. Once $\dot y < 0$, trajectories cannot approach the constraint. Due to the states' inertia, it is physically impossible to prevent all constraint violations. For instance, $y = y_{max}$, $\dot y >> 1$ will yield $y > y_{max}$ at the next timestep. Hence, we only aim at guaranteeing the safety of trajectories entering buffer $\mathcal{B}$.
  • Figure 2: The inverted pendulum Gym environment Gym annotated with cart position $p$, pendulum angle $\theta$, and buffer $\mathcal{B}$.
  • Figure 3: Phase portrait of $(\theta, \dot \theta )$ for the inverted pendulum. None of the POLICEd trajectories (blue) entering buffer $\mathcal{B}$ (green) cross constraint line $\theta = 0.2$ rad (dashed red), whereas some of the baseline trajectories do (dotted orange). Our approach guarantees that a pole arriving at $\theta = 0.1$ rad with a velocity $\dot \theta < 1$ rad/s will satisfy $\theta \leq 0.2$ rad. We do not guarantee the safety of POLICEd trajectories not entering the buffer.
  • Figure 4: Illustration of our Space Shuttle environment. The state $x \in \mathbb{R}^3$ is composed of the altitude or height $h$ of the shuttle, its flight path angle $\gamma$, and its velocity $v$. The control action is the angle of attack $\alpha$.
  • Figure 5: Phase portrait of the space shuttle landing. POLICEd trajectories (blue) entering buffer $\mathcal{B}$ (green) all converge to a set of target conditions (pink) with small vertical velocity from which landing is feasible. However, the baseline trajectories (dotted orange) reach the ground $h = 0$ with high vertical velocities $\dot h \leq -6$ ft/s resulting in a crash of the shuttle (x).
  • ...and 2 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm: admissible trajectories']}
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 6 more