Table of Contents
Fetching ...

Data-Driven Synthesis of Probabilistic Controlled Invariant Sets for Linear MDPs

Kazumune Hashimoto, Shunki Kimura, Kazunobu Serizawa, Junya Ikemoto, Yulong Gao, Kai Cai

Abstract

We study data-driven computation of probabilistic controlled invariant sets (PCIS) for safety-critical reinforcement learning under unknown dynamics. Assuming a linear MDP model, we use regularized least squares and self-normalized confidence bounds to construct a conservative estimate of the states from which the system can be kept inside a prescribed safe region over an \(N\)-step horizon, together with the corresponding set-valued safe action map. This construction is obtained through a backward recursion and can be interpreted as a conservative approximation of the \(N\)-step safety predecessor operator. When the associated conservative-inclusion event holds, a conservative fixed point of the approximate recursion can be certified as an \((N,ε)\)-PCIS with confidence at least \(η\). For continuous state spaces, we introduce a lattice abstraction and a Lipschitz-based discretization error bound to obtain a tractable approximation scheme. Finally, we use the resulting conservative fixed-point approximation as a runtime candidate PCIS in a practical shielding architecture with iterative updates, and illustrate the approach on a numerical experiment.

Data-Driven Synthesis of Probabilistic Controlled Invariant Sets for Linear MDPs

Abstract

We study data-driven computation of probabilistic controlled invariant sets (PCIS) for safety-critical reinforcement learning under unknown dynamics. Assuming a linear MDP model, we use regularized least squares and self-normalized confidence bounds to construct a conservative estimate of the states from which the system can be kept inside a prescribed safe region over an -step horizon, together with the corresponding set-valued safe action map. This construction is obtained through a backward recursion and can be interpreted as a conservative approximation of the -step safety predecessor operator. When the associated conservative-inclusion event holds, a conservative fixed point of the approximate recursion can be certified as an \((N,ε)\)-PCIS with confidence at least . For continuous state spaces, we introduce a lattice abstraction and a Lipschitz-based discretization error bound to obtain a tractable approximation scheme. Finally, we use the resulting conservative fixed-point approximation as a runtime candidate PCIS in a practical shielding architecture with iterative updates, and illustrate the approach on a numerical experiment.

Paper Structure

This paper contains 12 sections, 5 theorems, 94 equations, 6 figures, 2 tables, 4 algorithms.

Key Result

Lemma 1

Let assumption:linear hold and let $p:\mathcal{X}\to[0,1]$ be measurable. Then there exists $\theta_p\in\mathbb{R}^d$ such that, for all $(x,u)\in\mathcal{X}\times\mathcal{U}$, where $x'\sim\mathbb{P}(\cdot\mid x,u)$. $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Overview of safe RL via PCIS-based shielding with a grow/certify split.
  • Figure 2: Representative state-space trajectories for unshielded SARSA and shielded SARSA. The bold blue curve denotes the trajectory portion generated in the current update interval, the thin gray curves denote earlier update intervals, the red dashed rectangle indicates the safe set, and the green region indicates the current candidate PCIS when shielding is active.
  • Figure 3: Representative state-space trajectories for unshielded DQN and shielded DQN. The bold blue curve denotes the trajectory portion generated in the current update interval, the thin gray curves denote earlier update intervals, the red dashed rectangle indicates the safe set, and the green region indicates the current candidate PCIS when shielding is active.
  • Figure 4: Mean $\pm$ standard deviation of returns accumulated over update intervals for DQN with and without shielding, over 30 random seeds.
  • Figure 5: Mean $\pm$ standard deviation of returns accumulated over update intervals for SARSA with and without shielding, over 30 random seeds.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 1: $(N,\epsilon)$-PCIS
  • Definition 2: $\eta$-conservative approximation of the operator/candidate PCIS
  • Lemma 1
  • proof
  • Theorem 1: $\eta$-conservative approximation operator guarantee
  • proof
  • Corollary 1: Certification of PCIS by sample splitting
  • proof
  • Theorem 2: Lattice-based conservative operator
  • Corollary 2: Safe actions for a fixed reference set
  • ...and 5 more