AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan; Wilbert Pumacay; Nishanth Kumar; Yi Ru Wang; Shulin Tian; Wentao Yuan; Ranjay Krishna; Dieter Fox; Ajay Mandlekar; Yijie Guo

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo

TL;DR

This work presents Aha, an open-source VLM for detecting and reasoning about failures in robotic manipulation by treating failure recognition as free-form language reasoning. It introduces FailGen, a scalable data-generation pipeline that creates 49k failure demonstrations across 79 RLBench tasks to instruction-tune Aha (Aha-13B), achieving strong cross-domain generalization to real-world failures and unseen tasks. The model demonstrates superior failure reasoning across multiple datasets and enhances downstream robotic systems by providing natural-language failure feedback to improve rewards, planning, and verification, outperforming GPT-4o and several VLM baselines. Overall, Aha offers a practical path to richer failure understanding in robotics, with open-source tooling and demonstrated impact on real manipulation pipelines.

Abstract

Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they still struggle with failure recognition, limiting their real-world applicability. We introduce AHA, an open-source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine-tuned AHA using FailGen, a scalable framework that generates the first large-scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, robotic systems, and unseen tasks. It surpasses the second-best model (GPT-4o in-context learning) by 10.3% and exceeds the average performance of six compared models including five state-of-the-art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero-shot trajectory generation. AHA's failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub-task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT-4 models.

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

TL;DR

Abstract

Paper Structure (15 sections, 4 figures, 3 tables)

This paper contains 15 sections, 4 figures, 3 tables.

Introduction
Related Work
The Aha Dataset
Failure Modes in Robotic Manipulation
Implementation of the Aha dataset
Method
Failure Reasoning Formulation
Synthetic Data for Instruction-tuning
Instruction Fine-tuning
Experimental Results
Experimental Setup
Quantitative Experimental Results
Downstream Robotics Tasks
Conclusion
Acknowledgement

Figures (4)

Figure 1: Aha is a Vision-Language Model designed to detect and reason about failures in robotic manipulation. As an instruction-tuned VLM, it can enhance task performance in robotic applications that utilize VLMs for reward generation, task planning, or sub-task verification. By incorporating Aha into the reasoning pipeline, these applications can achieve accelerated and improved performance.
Figure 2: Overview of Aha Pipeline. (Top) The data generation for Aha is accomplished by taking a normal task trajectory in simulation and procedurally perturbing all keyframes using our taxonomy of failure modes. Through FailGen, we systematically alter keyframes to synthesize failure demonstrations conditioned on the original tasks. Simultaneously, we generate corresponding query and answer prompts for each task and failure mode, which are used for instruction-tuning. (Bottom) The instruction-tuning pipeline follows the same fine-tuning procedure as LLaVA-v1.5 liu2023improvedllava, where we fine-tune only the LLM base modelâ€”in this case, LLaMA-2-13B and the projection linear layers, while freezing the image encoder and tokenizer.
Figure 3: (Left) Scaling law with the Aha dataset. Scaling of effect of model performance with varying domain specific fine-tuning data. (Right) Downstream Robotic Application Performance.Aha-13B outperforms GPT-4o in reasoning about failures within these robotic applications, leading to improved performance of the downstream tasks.
Figure 4: Downstream Robotic Application. We demonstrated that Aha can be integrated into existing LLM/VLM-assisted robotic applications to provide failure reasoning and feedback, helping to accelerate and improve task success rates in these systems.

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

TL;DR

Abstract

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)