Table of Contents
Fetching ...

A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models, Reactive Planner and Behavior Trees

Faseeh Ahmad, Hashim Ismail, Jonathan Styrud, Maj Stenmark, Volker Krueger

TL;DR

This work tackles real-time failure handling in dynamic robotics by proposing a unified framework that fuses Vision-Language Models (VLMs), a reactive planner, and Behavior Trees (BTs) to perform pre-execution verification and continuous execution monitoring. The approach introduces a Verifier and Suggestor that, along with a continuously updated scene graph and execution history, enable context-aware detection, identification, and correction of failures during task execution. Experimental validation on AI2-THOR simulations and a real ABB YuMi platform demonstrates improved task success and adaptability over pre-execution or reactive methods alone, with ablations underscoring the value of VLM reasoning, structured scene understanding, and execution-history tracking. The results suggest significant potential for robust, autonomous failure recovery in real-world robotic applications, reducing downtime and enabling safer human-robot collaboration.

Abstract

Robotic systems often face execution failures due to unexpected obstacles, sensor errors, or environmental changes. Traditional failure recovery methods rely on predefined strategies or human intervention, making them less adaptable. This paper presents a unified failure recovery framework that combines Vision-Language Models (VLMs), a reactive planner, and Behavior Trees (BTs) to enable real-time failure handling. Our approach includes pre-execution verification, which checks for potential failures before execution, and reactive failure handling, which detects and corrects failures during execution by verifying existing BT conditions, adding missing preconditions and, when necessary, generating new skills. The framework uses a scene graph for structured environmental perception and an execution history for continuous monitoring, enabling context-aware and adaptive failure handling. We evaluate our framework through real-world experiments with an ABB YuMi robot on tasks like peg insertion, object sorting, and drawer placement, as well as in AI2-THOR simulator. Compared to using pre-execution and reactive methods separately, our approach achieves higher task success rates and greater adaptability. Ablation studies highlight the importance of VLM-based reasoning, structured scene representation, and execution history tracking for effective failure recovery in robotics.

A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models, Reactive Planner and Behavior Trees

TL;DR

This work tackles real-time failure handling in dynamic robotics by proposing a unified framework that fuses Vision-Language Models (VLMs), a reactive planner, and Behavior Trees (BTs) to perform pre-execution verification and continuous execution monitoring. The approach introduces a Verifier and Suggestor that, along with a continuously updated scene graph and execution history, enable context-aware detection, identification, and correction of failures during task execution. Experimental validation on AI2-THOR simulations and a real ABB YuMi platform demonstrates improved task success and adaptability over pre-execution or reactive methods alone, with ablations underscoring the value of VLM reasoning, structured scene understanding, and execution-history tracking. The results suggest significant potential for robust, autonomous failure recovery in real-world robotic applications, reducing downtime and enabling safer human-robot collaboration.

Abstract

Robotic systems often face execution failures due to unexpected obstacles, sensor errors, or environmental changes. Traditional failure recovery methods rely on predefined strategies or human intervention, making them less adaptable. This paper presents a unified failure recovery framework that combines Vision-Language Models (VLMs), a reactive planner, and Behavior Trees (BTs) to enable real-time failure handling. Our approach includes pre-execution verification, which checks for potential failures before execution, and reactive failure handling, which detects and corrects failures during execution by verifying existing BT conditions, adding missing preconditions and, when necessary, generating new skills. The framework uses a scene graph for structured environmental perception and an execution history for continuous monitoring, enabling context-aware and adaptive failure handling. We evaluate our framework through real-world experiments with an ABB YuMi robot on tasks like peg insertion, object sorting, and drawer placement, as well as in AI2-THOR simulator. Compared to using pre-execution and reactive methods separately, our approach achieves higher task success rates and greater adaptability. Ablation studies highlight the importance of VLM-based reasoning, structured scene representation, and execution history tracking for effective failure recovery in robotics.

Paper Structure

This paper contains 27 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our approach, which consists of two phases: pre-execution verification and real-time monitoring. The pre-execution phase verifies the entire planned BT proactively using a VLM based on inputs (images, scene graphs, skills, and conditions). The real-time phase continuously monitors execution, where the VLM verifies preconditions, postconditions, and infers missing preconditions for individual skills using updated inputs and execution history. A reactive planner dynamically generates and adapts the BT as the robot’s execution policy.
  • Figure 2: Three failure instances with corresponding VLM responses. (a) Pre-execution verification detects that the black object blocks the hole, and the VLM suggests adding the missing precondition for the place skill. (b) Precondition verification identifies that the grasp skill fails due to an unmet condition, as the robot is already holding a red object. (c) Postcondition verification detects a failed placement since the blue object is on top of the green object instead of inside. Failure detection (red), identification (orange), and correction (blue) are indicated with corresponding VLM responses in black.
  • Figure 3: BT of the peg-in-hole task without failure handling
  • Figure 4: Extended BT execution where a missing precondition is added, ensuring the gripper is empty before grasping target object.
  • Figure 5: The figure illustrates three failure scenarios and corresponding VLM responses. (a) Precondition suggestor: The red object inside the green object leads the VLM to identify a missing precondition for the place skill. (b) Pre-execution missing skill generation: The VLM identifies the need for a push skill to remove the red object. (c) Real-time missing skill generation: The VLM suggests generating the push skill during execution. Failure detection (red), identification (orange), and correction (blue) phases are depicted, with VLM responses in black.