Table of Contents
Fetching ...

Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, Jeannette Bohg

TL;DR

Sentinel is a runtime monitoring framework that splits the detection of failures into two complementary categories: Erratic failures, which the authors detect using statistical measures of temporal action consistency, and task progression failures, where they use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task.

Abstract

Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure. Qualitative results are made available at https://sites.google.com/stanford.edu/sentinel.

Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

TL;DR

Sentinel is a runtime monitoring framework that splits the detection of failures into two complementary categories: Erratic failures, which the authors detect using statistical measures of temporal action consistency, and task progression failures, where they use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task.

Abstract

Robot behavior policies trained via imitation learning are prone to failure under conditions that deviate from their training data. Thus, algorithms that monitor learned policies at test time and provide early warnings of failure are necessary to facilitate scalable deployment. We propose Sentinel, a runtime monitoring framework that splits the detection of failures into two complementary categories: 1) Erratic failures, which we detect using statistical measures of temporal action consistency, and 2) task progression failures, where we use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. Our approach has two key strengths. First, because learned policies exhibit diverse failure modes, combining complementary detectors leads to significantly higher accuracy at failure detection. Second, using a statistical temporal action consistency measure ensures that we quickly detect when multimodal, generative policies exhibit erratic behavior at negligible computational cost. In contrast, we only use VLMs to detect failure modes that are less time-sensitive. We demonstrate our approach in the context of diffusion policies trained on robotic mobile manipulation domains in both simulation and the real world. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone and significantly outperforms baselines, thus highlighting the importance of assigning specialized detectors to complementary categories of failure. Qualitative results are made available at https://sites.google.com/stanford.edu/sentinel.
Paper Structure (56 sections, 3 theorems, 8 equations, 9 figures, 7 tables)

This paper contains 56 sections, 3 theorems, 8 equations, 9 figures, 7 tables.

Key Result

Proposition 1

Let $P_\tau$ denote the distribution of success trajectories in the validation dataset $\mathcal{D}_\tau = \{\tau^i\}_{i=1}^M \overset{\textup{iid}}{\sim} P_\tau$. Then, the FPR---the probability of raising a false alarm at any point during an i.i.d. test trajectory $\tau \sim P_\tau$ of length $H'

Figures (9)

  • Figure 1: We present Sentinel, a runtime monitor that detects unknown failures of generative robot policies at deployment time. Constructing Sentinel requires only a set of successful policy rollouts and a description of the task, from which it detects diverse failures by monitoring (a) the temporal consistency of action-chunk distributions generated by the policy and (b) the task progress of the robot(s) through video QA with Vision Language Models.
  • Figure 2: Action sequence prediction overlap during policy rollout.
  • Figure 3: Overview of Sentinel. The images depict a policy rollout for timesteps $t=1,\ldots, T$. Temporal Consistency Detector: At each timestep $t$, the state $s_t$ is passed to the generative policy to obtain action distributions $\pi_t$ between which statistical distances $\hat{D}_t$ are computed to measure temporal consistency. The statistical distances are summed up to the current timestep $T$ (as in Eq. \ref{['eq:cum-score-fn']}) and thresholded by $\gamma$ to detect policy failure. Vision Language Model (VLM) Detector: The VLM classifies whether the policy is failing to make progress on its task given a video up to timestep $T$ and a description of the task. Execution stops if either detector raises a warning.
  • Figure 4: Temporal consistency scores grow faster when the policy fails. Error bars indicate the 5-th and 95-th score quantiles.
  • Figure 5: Detecting failures in PushT. Left: Our failure detector (STAC) which measures the temporal consistency of a generative policy outperforms several families of out-of-distribution detectors. Right: The best performance comes from measuring temporal consistency with statistical distance functions; augmenting baselines with temporal consistency does not always increase their performance.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 1: STAC has low FPR
  • Theorem 1: Adapted from Thm. D.1 in angelopoulos2021gentle
  • Proposition 2: STAC has low FPR
  • proof