Table of Contents
Fetching ...

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Clemence Grislain, Hamed Rahimi, Olivier Sigaud, Mohamed Chetouani

TL;DR

This work proposes a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets, and presents I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection.

Abstract

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

TL;DR

This work proposes a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets, and presents I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection.

Abstract

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of I-FailSense, which classifies a robot’s observation trajectories conditioned on language instructions into failure or success. Trained on semantic misalignment failure detection, I-FailSense excels at identifying these challenging errors, zero-shot generalizes to detecting control errors and errors in new simulation environments, and detects errors in real-world observations with minimal post-training.
  • Figure 2: I-FailSense Architecture. (A) The model takes as input an observation trajectory aggregated into a single image $\tau \in \mathbb{R}^{3\times (H.N)\times(W.T)}$ (here, $N$=1 PoV and $T$=4 timesteps) and a semantic goal $g$, and outputs a binary success/failure prediction $\hat{y}$. I-FailSense is built on a base VLM (PaliGemma2-mix steiner2024paligemma2) and is post-trained in two stages: (1) the projection MLP is fine-tuned along with LoRA adapters applied to the language modules of the VLM's LLM base model, and (2) the VLM is frozen while the FS blocks, attached to the adapted language modules, are fine-tuned for binary classification. The FS block outputs are aggregated with the VLM’s final output through a voting mechanism to produce the final prediction. (B) The FS block architecture shows an hybrid attention pooling module composed of multi-head attention (MHA) and MLP followed by residual MLP blocks with batch normalization and ending in a binary classification MLP.
  • Figure 3: Example data in $\mathcal{D}_{\text{SMF-CALVIN}}$: Top: a positive example where the observation trajectory correctly matches the paired instruction. Bottom: a negative example illustrating semantic misalignment, where the robot rotates the correct object—the pink cube—right instead of the instructed left.
  • Figure 4: Example data in $\mathcal{D}_{\text{AHA}}$: Two negative examples from the AHA dataset (exocentric PoV) demonstrating control failures—top: the knife slips through the robot's gripper; bottom: the robot fails to grasp the computer lid.
  • Figure 5: Example data in $\mathcal{D}_{\text{SMF-DROID}}$: Two examples from the semantic misalignment failure dataset built on DROID (exocentric PoV)--top: a positive example where the instruction matches the observation trajectory; bottom: a negative example where the instruction and trajectory mismatch.
  • ...and 1 more figures