Table of Contents
Fetching ...

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

Zijun Lin, Jiafei Duan, Haoquan Fang, Dieter Fox, Ranjay Krishna, Cheston Tan, Bihan Wen

TL;DR

FailSafe addresses the gap in Robotic Vision-Language-Action systems by automatically generating diverse failure scenarios paired with directly executable recovery actions and validating them through systematic verification. By fine-tuning a large VLM on FailSafe data to create FailSafe-VLM, the approach enables real-time failure detection and corrective action guidance that improves the performance of state-of-the-art VLA models by up to $22.6\%$ on ManiSkill tasks and generalizes across viewpoints, objects, and embodiments. The method introduces three failure modes ($x,y,z$ translation, rotation, and no-ops), a robust recovery-action collection pipeline yielding $7$-DoF corrections, and a richly populated dataset with multi-view observations to support robust learning. Overall, FailSafe demonstrates a scalable path toward more autonomous, robust, and explainable embodied AI in manipulation tasks and plans to release the code for community use.

Abstract

Recent advances in robotic manipulation have integrated low-level robotic control into Vision-Language Models (VLMs), extending them into Vision-Language-Action (VLA) models. Although state-of-the-art VLAs achieve strong performance in downstream robotic applications, supported by large-scale crowd-sourced robot training data, they still inevitably encounter failures during execution. Enabling robots to reason and recover from unpredictable and abrupt failures remains a critical challenge. Existing robotic manipulation datasets, collected in either simulation or the real world, primarily provide only ground-truth trajectories, leaving robots unable to recover once failures occur. Moreover, the few datasets that address failure detection typically offer only textual explanations, which are difficult to utilize directly in VLA models. To address this gap, we introduce FailSafe, a novel failure generation and recovery system that automatically produces diverse failure cases paired with executable recovery actions. FailSafe can be seamlessly applied to any manipulation task in any simulator, enabling scalable creation of failure action data. To demonstrate its effectiveness, we fine-tune LLaVa-OneVision-7B (LLaVa-OV-7B) to build FailSafe-VLM. Experimental results show that FailSafe-VLM successfully helps robotic arms detect and recover from potential failures, improving the performance of three state-of-the-art VLA models (pi0-FAST, OpenVLA, OpenVLA-OFT) by up to 22.6% on average across several tasks in Maniskill. Furthermore, FailSafe-VLM could generalize across different spatial configurations, camera viewpoints, object and robotic embodiments. We plan to release the FailSafe code to the community.

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

TL;DR

FailSafe addresses the gap in Robotic Vision-Language-Action systems by automatically generating diverse failure scenarios paired with directly executable recovery actions and validating them through systematic verification. By fine-tuning a large VLM on FailSafe data to create FailSafe-VLM, the approach enables real-time failure detection and corrective action guidance that improves the performance of state-of-the-art VLA models by up to on ManiSkill tasks and generalizes across viewpoints, objects, and embodiments. The method introduces three failure modes ( translation, rotation, and no-ops), a robust recovery-action collection pipeline yielding -DoF corrections, and a richly populated dataset with multi-view observations to support robust learning. Overall, FailSafe demonstrates a scalable path toward more autonomous, robust, and explainable embodied AI in manipulation tasks and plans to release the code for community use.

Abstract

Recent advances in robotic manipulation have integrated low-level robotic control into Vision-Language Models (VLMs), extending them into Vision-Language-Action (VLA) models. Although state-of-the-art VLAs achieve strong performance in downstream robotic applications, supported by large-scale crowd-sourced robot training data, they still inevitably encounter failures during execution. Enabling robots to reason and recover from unpredictable and abrupt failures remains a critical challenge. Existing robotic manipulation datasets, collected in either simulation or the real world, primarily provide only ground-truth trajectories, leaving robots unable to recover once failures occur. Moreover, the few datasets that address failure detection typically offer only textual explanations, which are difficult to utilize directly in VLA models. To address this gap, we introduce FailSafe, a novel failure generation and recovery system that automatically produces diverse failure cases paired with executable recovery actions. FailSafe can be seamlessly applied to any manipulation task in any simulator, enabling scalable creation of failure action data. To demonstrate its effectiveness, we fine-tune LLaVa-OneVision-7B (LLaVa-OV-7B) to build FailSafe-VLM. Experimental results show that FailSafe-VLM successfully helps robotic arms detect and recover from potential failures, improving the performance of three state-of-the-art VLA models (pi0-FAST, OpenVLA, OpenVLA-OFT) by up to 22.6% on average across several tasks in Maniskill. Furthermore, FailSafe-VLM could generalize across different spatial configurations, camera viewpoints, object and robotic embodiments. We plan to release the FailSafe code to the community.

Paper Structure

This paper contains 16 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An illustration of the FailSafe pipeline generating failure scenarios and corresponding executable recovery actions (above). Leveraging these, FailSafe enables FailSafe-VLM (below) to detect and recover from robot failures, while generalizing across different spatial configurations, viewing angles, object and embodiments.
  • Figure 2: Top: Overall pipeline of FailSafe, which includes the autonomous generation of failure trajectories (I) and collection of delta recovery action (II). Failure-Action data pairs are passed to the next step only after a systematic verification (III) ensures the effectiveness of recovery action. Bottom: The FailSafe dataset (IV) is then used to fine-tune FailSafe-VLM, which is able to help robotic arms recover from failure cases (V).
  • Figure 3: Illustration of how FailSafe-VLM collaborates with VLA models to perform failure reasoning and recovery. To simulate real-world settings, VLA models and FailSafe-VLM share the same camera view, which is used during VLA training but novel to FailSafe-VLM.
  • Figure 4: Examples of how FailSafe-VLM helps VLA models recover from failure scenarios, showing the x- and z-axis trajectories of the end effector over time (zoomed-in for clearer view).