FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

Zijun Lin; Jiafei Duan; Haoquan Fang; Dieter Fox; Ranjay Krishna; Cheston Tan; Bihan Wen

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

Zijun Lin, Jiafei Duan, Haoquan Fang, Dieter Fox, Ranjay Krishna, Cheston Tan, Bihan Wen

TL;DR

FailSafe addresses the gap in Robotic Vision-Language-Action systems by automatically generating diverse failure scenarios paired with directly executable recovery actions and validating them through systematic verification. By fine-tuning a large VLM on FailSafe data to create FailSafe-VLM, the approach enables real-time failure detection and corrective action guidance that improves the performance of state-of-the-art VLA models by up to $22.6\%$ on ManiSkill tasks and generalizes across viewpoints, objects, and embodiments. The method introduces three failure modes ($x,y,z$ translation, rotation, and no-ops), a robust recovery-action collection pipeline yielding $7$-DoF corrections, and a richly populated dataset with multi-view observations to support robust learning. Overall, FailSafe demonstrates a scalable path toward more autonomous, robust, and explainable embodied AI in manipulation tasks and plans to release the code for community use.

Abstract

Recent advances in robotic manipulation have integrated low-level robotic control into Vision-Language Models (VLMs), extending them into Vision-Language-Action (VLA) models. Although state-of-the-art VLAs achieve strong performance in downstream robotic applications, supported by large-scale crowd-sourced robot training data, they still inevitably encounter failures during execution. Enabling robots to reason and recover from unpredictable and abrupt failures remains a critical challenge. Existing robotic manipulation datasets, collected in either simulation or the real world, primarily provide only ground-truth trajectories, leaving robots unable to recover once failures occur. Moreover, the few datasets that address failure detection typically offer only textual explanations, which are difficult to utilize directly in VLA models. To address this gap, we introduce FailSafe, a novel failure generation and recovery system that automatically produces diverse failure cases paired with executable recovery actions. FailSafe can be seamlessly applied to any manipulation task in any simulator, enabling scalable creation of failure action data. To demonstrate its effectiveness, we fine-tune LLaVa-OneVision-7B (LLaVa-OV-7B) to build FailSafe-VLM. Experimental results show that FailSafe-VLM successfully helps robotic arms detect and recover from potential failures, improving the performance of three state-of-the-art VLA models (pi0-FAST, OpenVLA, OpenVLA-OFT) by up to 22.6% on average across several tasks in Maniskill. Furthermore, FailSafe-VLM could generalize across different spatial configurations, camera viewpoints, object and robotic embodiments. We plan to release the FailSafe code to the community.

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

TL;DR

Abstract

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)