Table of Contents
Fetching ...

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Piaopiao Jin, Guokang Sun, Shaoqing Xu, Yangwei You, Jingtai Liu

TL;DR

FPC-VLA addresses the fragility of open-ended robotic manipulation by coupling a Vision-Language-Action policy with a VLM-based supervisor that predicts and corrects potential failures. A dual-stream action fusion module stabilizes action sequences by incorporating history and decoupling pose from gripper state, while an automated RLDS-derived dataset enables scalable supervision. Across SIMPLER, LIBERO, and real-world robots, FPC-VLA achieves state-of-the-art zero-shot and fine-tuned performance, and demonstrates notable cross-platform generalization and real-world reliability. The framework advances practical autonomy by enabling proactive failure handling and smoother, more reliable manipulation in unstructured environments.

Abstract

Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A dual-stream fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

TL;DR

FPC-VLA addresses the fragility of open-ended robotic manipulation by coupling a Vision-Language-Action policy with a VLM-based supervisor that predicts and corrects potential failures. A dual-stream action fusion module stabilizes action sequences by incorporating history and decoupling pose from gripper state, while an automated RLDS-derived dataset enables scalable supervision. Across SIMPLER, LIBERO, and real-world robots, FPC-VLA achieves state-of-the-art zero-shot and fine-tuned performance, and demonstrates notable cross-platform generalization and real-world reliability. The framework advances practical autonomy by enabling proactive failure handling and smoother, more reliable manipulation in unstructured environments.

Abstract

Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A dual-stream fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.

Paper Structure

This paper contains 21 sections, 25 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: We propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. It outperforms other advanced methods in all benchmarks.
  • Figure 2: Architecture of FPC-VLA. The framework takes as input an observed image and a natural language instruction. A Vision-Language-Action (VLA) model first predicts a sequence of actions, which is then refined by a Dual-Stream Action Fusion Module that integrates historical predictions with the current prediction to generate the end-effector pose increment and the gripper state of the robotic arm. At keyframes where the gripper state changes, a VLM-based Supervisor detects potential failures and provides corrective guidance to optimize the action execution.
  • Figure 3: Architecture of Vision-Language-Action Model. The input image passes through two visual feature encoders and an MLP projector, is concatenated with tokenized language features, and then fed into Llama2 to obtain cognitive features. These features serve as conditioning inputs to a Diffusion Transformer, which progressively denoises noise to generate a predicted action sequence conditioned on the current observation.
  • Figure 4: Simulation results of FPC-VLA on different robots. The model performs well even in long-horizon tasks and non-pick-and-place tasks (e.g., opening drawers and turning on the stove).
  • Figure 5: Real world experiments results of FPC-VLA on Xiaomi Robot and ALOHA. Supervisor’s failure correction process is demonstrated taking the task "Stack orange block on green block" as an example.
  • ...and 3 more figures