FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

Yifan Yang; Zhixiang Duan; Tianshi Xie; Fuyu Cao; Pinxi Shen; Peili Song; Piaopiao Jin; Guokang Sun; Shaoqing Xu; Yangwei You; Jingtai Liu

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

Yifan Yang, Zhixiang Duan, Tianshi Xie, Fuyu Cao, Pinxi Shen, Peili Song, Piaopiao Jin, Guokang Sun, Shaoqing Xu, Yangwei You, Jingtai Liu

TL;DR

FPC-VLA addresses the fragility of open-ended robotic manipulation by coupling a Vision-Language-Action policy with a VLM-based supervisor that predicts and corrects potential failures. A dual-stream action fusion module stabilizes action sequences by incorporating history and decoupling pose from gripper state, while an automated RLDS-derived dataset enables scalable supervision. Across SIMPLER, LIBERO, and real-world robots, FPC-VLA achieves state-of-the-art zero-shot and fine-tuned performance, and demonstrates notable cross-platform generalization and real-world reliability. The framework advances practical autonomy by enabling proactive failure handling and smoother, more reliable manipulation in unstructured environments.

Abstract

Robotic manipulation is a fundamental component of automation. However, traditional perception-planning pipelines often fall short in open-ended tasks due to limited flexibility, while the architecture of a single end-to-end Vision-Language-Action (VLA) offers promising capabilities but lacks crucial mechanisms for anticipating and recovering from failure. To address these challenges, we propose FPC-VLA, a dual-model framework that integrates VLA with a supervisor for failure prediction and correction. The supervisor evaluates action viability through vision-language queries and generates corrective strategies when risks arise, trained efficiently without manual labeling. A dual-stream fusion module further refines actions by leveraging past predictions. Evaluation results on multiple simulation platforms (SIMPLER and LIBERO) and robot embodiments (WidowX, Google Robot, Franka) show that FPC-VLA outperforms state-of-the-art models in both zero-shot and fine-tuned settings. Successful real-world deployments on diverse, long-horizon tasks confirm FPC-VLA's strong generalization and practical utility for building more reliable autonomous systems.

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

TL;DR

Abstract

FPC-VLA: A Vision-Language-Action Framework with a Supervisor for Failure Prediction and Correction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)