Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

Prashant Solanki; Isabelle El-Hajj; Jasper J. van Beers; Erik-Jan van Kampen; Coen C. de Visser

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

Prashant Solanki, Isabelle El-Hajj, Jasper J. van Beers, Erik-Jan van Kampen, Coen C. de Visser

TL;DR

The approach on a double-integrator system is demonstrated by formally certifying, via satisfiability modulo theories (SMT), a value function learned through reinforcement learning to induce provably correct inner and outer backward-reachable set enclosures over a compact region of interest.

Abstract

We present a framework to \emph{certify} Hamilton--Jacobi (HJ) reachability learned by reinforcement learning (RL). Building on a discounted initial time \emph{travel-cost} formulation that makes small-step RL value iteration provably equivalent to a forward Hamilton--Jacobi (HJ) equation with damping, we convert certified learning errors into calibrated inner/outer enclosures of strict backward reachable tube. The core device is an additive-offset identity: if $W_λ$ solves the discounted travel-cost Hamilton--Jacobi--Bellman (HJB) equation, then $W_\varepsilon:=W_λ+ \varepsilon$ solves the same PDE with a constant offset $λ\varepsilon$. This means that a uniform value error is \emph{exactly} equal to a constant HJB offset. We establish this uniform value error via two routes: (A) a Bellman operator-residual bound, and (B) a HJB PDE-slack bound. Our framework preserves HJ-level safety semantics and is compatible with deep RL. We demonstrate the approach on a double-integrator system by formally certifying, via satisfiability modulo theories (SMT), a value function learned through reinforcement learning to induce provably correct inner and outer backward-reachable set enclosures over a compact region of interest.

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

TL;DR

Abstract

solves the discounted travel-cost Hamilton--Jacobi--Bellman (HJB) equation, then

solves the same PDE with a constant offset

. This means that a uniform value error is \emph{exactly} equal to a constant HJB offset. We establish this uniform value error via two routes: (A) a Bellman operator-residual bound, and (B) a HJB PDE-slack bound. Our framework preserves HJ-level safety semantics and is compatible with deep RL. We demonstrate the approach on a double-integrator system by formally certifying, via satisfiability modulo theories (SMT), a value function learned through reinforcement learning to induce provably correct inner and outer backward-reachable set enclosures over a compact region of interest.

Paper Structure (35 sections, 7 theorems, 123 equations, 1 figure, 1 table)

This paper contains 35 sections, 7 theorems, 123 equations, 1 figure, 1 table.

Introduction
Problem Setup
Discounted Travel Value
Initial/Forward Time Formulation
Discounted HJB for an Additively Shifted Value
Value Function with Additive Offset
Dynamic Programming Principle Associated With Value Function with Additive Offset
HJB PDE Associated with $V_\varepsilon$
Forward/initial time formulation and corresponding HJB
RL counterpart and $\varepsilon$-residual implications
Discrete-time Bellman Operator
Bellman Operator Value-error and HJB PDE slack Bound Guarantees
SMT-Based Certification of Uniform Value Error on a Region of Interest
Two SMT routes to a certified value bound
Route A: Bellman-operator residual on an invariant ROI
...and 20 more sections

Key Result

Theorem 1

Fix $(t,x)\in[0,T)\times\mathbb{R}^n$ and $\sigma\in(0,T-t]$. For the additively shifted value the following DPP holds:

Figures (1)

Figure 1: PDE numeric solution, NN prediction, and their difference $\Delta=W_{\text{num}}-W_{\text{nn}}$ on the same grid. Errors are small and localized near the bang–bang switching locus and the ROI boundary; the observed magnitude ($\approx 10^{-2}$) is consistent with the SMT-certified residual tolerance $\varepsilon=0.1$.

Theorems & Definitions (19)

Remark 1: Discount and contraction
Theorem 1: Dynamic Programming Principle for the Additively-shifted Value $V_\varepsilon$
proof
Lemma 1
proof
Lemma 2
proof
Theorem 2: Viscosity HJB for $V_\varepsilon$
proof
Lemma 3: Contraction and value error
...and 9 more

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

TL;DR

Abstract

Certifying Hamilton-Jacobi Reachability Learned via Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (19)