Table of Contents
Fetching ...

Learning Verifiable Control Policies Using Relaxed Verification

Puja Chaudhury, Alexander Estornell, Michael Everett

TL;DR

The paper tackles safety guarantees for learning-based control by integrating differentiable reachability bounds into the training loop, using LiRPA/CROWN to produce tight, runtime-checkable guarantees. By optimizing a composite loss that encodes reach-avoid and invariance properties over a horizon $T$, the method guides neural policies toward verifiable safety specifications. Empirical results on a 2D unicycle and a 6D quadrotor show that policies trained with these relaxed, differentiable bounds can both reach goals and avoid obstacles, while yielding tight bound estimates suitable for lightweight verification at runtime. The approach promises safer deployment of learning-based controllers by combining training-time verification with scalable bound computation. Future work includes extending scalability, more challenging specifications, and integration with imitation learning for demonstrations.

Abstract

To provide safety guarantees for learning-based control systems, recent work has developed formal verification methods to apply after training ends. However, if the trained policy does not meet the specifications, or there is conservatism in the verification algorithm, establishing these guarantees may not be possible. Instead, this work proposes to perform verification throughout training to ultimately aim for policies whose properties can be evaluated throughout runtime with lightweight, relaxed verification algorithms. The approach is to use differentiable reachability analysis and incorporate new components into the loss function. Numerical experiments on a quadrotor model and unicycle model highlight the ability of this approach to lead to learned control policies that satisfy desired reach-avoid and invariance specifications.

Learning Verifiable Control Policies Using Relaxed Verification

TL;DR

The paper tackles safety guarantees for learning-based control by integrating differentiable reachability bounds into the training loop, using LiRPA/CROWN to produce tight, runtime-checkable guarantees. By optimizing a composite loss that encodes reach-avoid and invariance properties over a horizon , the method guides neural policies toward verifiable safety specifications. Empirical results on a 2D unicycle and a 6D quadrotor show that policies trained with these relaxed, differentiable bounds can both reach goals and avoid obstacles, while yielding tight bound estimates suitable for lightweight verification at runtime. The approach promises safer deployment of learning-based controllers by combining training-time verification with scalable bound computation. Future work includes extending scalability, more challenging specifications, and integration with imitation learning for demonstrations.

Abstract

To provide safety guarantees for learning-based control systems, recent work has developed formal verification methods to apply after training ends. However, if the trained policy does not meet the specifications, or there is conservatism in the verification algorithm, establishing these guarantees may not be possible. Instead, this work proposes to perform verification throughout training to ultimately aim for policies whose properties can be evaluated throughout runtime with lightweight, relaxed verification algorithms. The approach is to use differentiable reachability analysis and incorporate new components into the loss function. Numerical experiments on a quadrotor model and unicycle model highlight the ability of this approach to lead to learned control policies that satisfy desired reach-avoid and invariance specifications.

Paper Structure

This paper contains 12 sections, 2 theorems, 15 equations, 5 figures, 1 algorithm.

Key Result

Theorem II.1

Given a function $g: \mathcal{X}\subseteq\mathbb{R}^n\to\mathcal{Y}\subseteq\mathbb{R}^m$, input set $\mathcal{X}'\subseteq\mathcal{X}$, and polytope facets $\mathbf{C}\in\mathcal{R}^{c\times m}$, LiRPA calculates $\mathbf{d}\in\mathbb{R}^c$, which defines polytope outer bounds on the image $g(\math

Figures (5)

  • Figure 1: Learning verifiable control policies. The objective is to synthesize a NN control policy guided by a safety specifications. At each training iteration, loss terms based on reachability analysis are used to update the NN parameters.
  • Figure 2: Comparison of reachable sets computed by each method for a unicycle model and reach-avoid specification. The policy trained with numerical gradients (left) is sensitive to the initial state. The affine policy trained with CROWN loss (middle) reaches the goal and its sampled trajectories avoid the obstacles, but the reachable set bounds collide with an obstacle. The NN policy (right) trained with CROWN loss reaches the goal, and the reachable set bounds are sufficiently tight that they also avoid the obstacles.
  • Figure 3: Effect of Bound Volume Loss, $\mathcal{L}_\text{vol}$. For the unicycle system with a different start/goal from before, the reachable sets calculated by CROWN after training are much tighter when $w_\text{vol}=4$ (left) than $w_\text{vol}=0.05$ (right), even though the system's performance and true reachable sets are nearly identical. This highlights the benefit of including $\mathcal{L}_\text{vol}$ in the training process: it enables obtaining reasonably tight bounds even with a relaxed verification algorithm.
  • Figure 4: Effect of Invariance Loss, $\mathcal{L}_\text{inv}$. Without considering this loss, the reachable sets often continue to grow or go past the goal (e.g., see \ref{['fig:tightness']}). Here, with $w_\text{inv}=100$, the reachable set bounds at the 22nd timestep (bright green set) is a forward invariant set, i.e., the system will never leave this set once it enters. This was confirmed by checking that this set's next reachable set (using CROWN) is a subset.
  • Figure 5: Obstacle avoidance with 6D quadrotor model ($xyz$ positions shown). The trained policy starts in the set with blue samples and reaches the goal (green) in 20 timesteps. Along the way, both the true system trajectories (samples) and reachable set bounds (calculated with CROWN) avoid the obstacles.

Theorems & Definitions (2)

  • Theorem II.1: Linear Relaxation-based Perturbation Analysis (LiRPA)
  • Corollary II.2: zhang2018efficient, Closed-Loop Reachability Analysis with CROWN