Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Johannes Schmalz; Chaahat Jain

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Johannes Schmalz, Chaahat Jain

Abstract

Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline's core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Abstract

Paper Structure (16 sections, 7 theorems, 3 equations, 5 figures, 1 table, 5 algorithms)

This paper contains 16 sections, 7 theorems, 3 equations, 5 figures, 1 table, 5 algorithms.

Introduction
Background
Algorithms
$\text{TarjanSafe}$
Unsafety Propagation
Policy Iteration (PI)
Experiments
Benchmarks.
Results.
Comparing the variants of $\text{iPI}$:
Comparing $\text{iPI}$ to $\text{TarjanSafe}$:
Summary.
Conclusion
Propagating Unsafety with Binary Decision Diagrams
Is the Main Loop in $\text{iPI}$ Necessary?
...and 1 more sections

Key Result

Theorem 1

$\text{TarjanSafe}$ has a best-case runtime of $\Theta(|\pi\xspace_{\text{min-safe}}|)$ (if a safe policy exists).

Figures (5)

Figure 1: Non-deterministic task with 5 layers and $2^5$ paths from $s\xspace_0$ to $s\xspace_6$. Similarly constructed tasks with $d$ layers have $2^d$ paths from $s\xspace_0$ to $s\xspace_{d+1}$.
Figure 2: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time. The plots separate safe and unsafe states: (left) is an aggregation over all domains' safe states, (middle) is over all unsafe states, and (right) is over safe states in Flappy Bird.
Figure 3: A state-avoiding task where a single call to $\text{rec-}\text{iPI}\xspace$ finds an unsafe policy but does not recognise it as unsafe.
Figure 4: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time for unsafe states.
Figure 5: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time for safe states.

Theorems & Definitions (14)

Definition 1
Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof
Corollary 5
...and 4 more

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Abstract

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (14)