Table of Contents
Fetching ...

Failure Prediction at Runtime for Generative Robot Policies

Ralf Römer, Adrian Kobras, Luca Worbis, Angela P. Schoellig

TL;DR

Failure Prediction at Runtime for Generative Robot Policies introduces FIPER, a CP-calibrated framework for predicting task failures of diffusion- and flow-based imitation learning policies without failure data. It combines two high-signal indicators: (i) RND-OE, which detects consecutive out-of-distribution observations in the policy embedding space, and (ii) ACE, which measures persistent uncertainty in generated action chunks. The two scores are calibrated on a small set of successful rollouts via conformal prediction and fused with a logical AND to trigger a failure alarm within a sliding time window. Across five simulated and real-world environments with diverse failure modes, FIPER achieves earlier and more accurate failure prediction than baselines, demonstrating improved safety and interpretability for generative robot policies.

Abstract

Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.

Failure Prediction at Runtime for Generative Robot Policies

TL;DR

Failure Prediction at Runtime for Generative Robot Policies introduces FIPER, a CP-calibrated framework for predicting task failures of diffusion- and flow-based imitation learning policies without failure data. It combines two high-signal indicators: (i) RND-OE, which detects consecutive out-of-distribution observations in the policy embedding space, and (ii) ACE, which measures persistent uncertainty in generated action chunks. The two scores are calibrated on a small set of successful rollouts via conformal prediction and fused with a logical AND to trigger a failure alarm within a sliding time window. Across five simulated and real-world environments with diverse failure modes, FIPER achieves earlier and more accurate failure prediction than baselines, demonstrating improved safety and interpretability for generative robot policies.

Abstract

Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy's embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at https://tum-lsy.github.io/fiper_website.

Paper Structure

This paper contains 62 sections, 6 theorems, 33 equations, 12 figures, 6 tables.

Key Result

Proposition 1

Set $\delta \in (0,1)$, and define the thresholds $\gamma_{O,t}$ and $\gamma_{A,t}$ as described above. Then, the probability that the failure predictor eq:fiper_decision_fct of FIPER flags a new successful ID rollout ${\bm{\tau}} \sim q_\pi$ of length $T' \leq T$ as Fail at any policy timestep $t \

Figures (12)

  • Figure 1: FIPER can predict task failures of generative robot policies during runtime without using any failure data. FIPER detects two key signals indicative of impending failure: (i) consecutive out-of-distribution (OOD) observations via random network distillation in the policy’s observation embedding space (RND-OE) and (ii) persistently high uncertainty in generated actions via action-chunk entropy (ACE). Both scores are calibrated based on a few successful rollouts and aggregated over a sliding window. If both submodules issue a warning, then FIPER predicts a failure. Its task-agnostic design enables FIPER to issue accurate and early failure warnings in diverse environments.
  • Figure 2: RND-OE recognizes failure-prone out-of-distribution (OOD) situations using random network distillation (RND) in the policy's observation embedding space.
  • Figure 3: (a) to (c) In imitation learning from multimodal demonstrations, uncertainty in generated actions is reflected in entropy rather than variance. (d) Our action-chunk entropy (ACE) score is designed to handle observation-dependent action multimodality, whereas STAC 2024_agia_unpacking typically associates timesteps at which the policy decides on a behavior mode with high uncertainty.
  • Figure 4: Our proposed scores \ref{['eq:rnd_score']} and \ref{['eq:entropy_score']} can distinguish failures from benign out-of-distribution (OOD) situations that the policy can generalize to. We group the rollouts into four categories along two axes: Success vs. Fail, and in-distribution (ID) vs. OOD. Robust failure prediction requires distinguishing Success OOD from Fail ID. We report the mean values across all tasks and five seeds, with the bars indicating the 25/75% quantiles.
  • Figure 5: We compare our approach of aggregating uncertainty over a sliding window to accumulating scores over all past timesteps 2024_agia_unpacking. The latter method detects failure rollouts very late, mostly due to their greater length, whereas our approach can predict failures earlier. We use the same CP constant threshold definition for all methods, following STAC 2024_agia_unpacking. We normalize DT by the maximum episode length and average the results across five seeds, with the bars indicating the standard deviation.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Remark 1: Rollout data
  • Theorem 1: Adapted from angelopoulos2023conformal, Theorem D.1
  • Proposition 2: Bounded FPR with a CP constant threshold
  • proof
  • Theorem 2: Adapted from diquigiovanni2024importance, Appendix A.3
  • Proposition 3: Bounded FPR with a one-sided CP band
  • proof
  • Remark 2
  • Proposition 4: \ref{['prop:fiper_bound']} extended
  • ...and 1 more