Table of Contents
Fetching ...

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Johannes Schmalz, Chaahat Jain

Abstract

Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline's core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Abstract

Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline's core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.
Paper Structure (16 sections, 7 theorems, 3 equations, 5 figures, 1 table, 5 algorithms)

This paper contains 16 sections, 7 theorems, 3 equations, 5 figures, 1 table, 5 algorithms.

Key Result

Theorem 1

$\text{TarjanSafe}$ has a best-case runtime of $\Theta(|\pi\xspace_{\text{min-safe}}|)$ (if a safe policy exists).

Figures (5)

  • Figure 1: Non-deterministic task with 5 layers and $2^5$ paths from $s\xspace_0$ to $s\xspace_6$. Similarly constructed tasks with $d$ layers have $2^d$ paths from $s\xspace_0$ to $s\xspace_{d+1}$.
  • Figure 2: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time. The plots separate safe and unsafe states: (left) is an aggregation over all domains' safe states, (middle) is over all unsafe states, and (right) is over safe states in Flappy Bird.
  • Figure 3: A state-avoiding task where a single call to $\text{rec-}\text{iPI}\xspace$ finds an unsafe policy but does not recognise it as unsafe.
  • Figure 4: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time for unsafe states.
  • Figure 5: Cumulative plots of the proportion of states decided $\text{w.r.t.\ }$ time for safe states.

Theorems & Definitions (14)

  • Definition 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Corollary 5
  • ...and 4 more