Table of Contents
Fetching ...

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang

TL;DR

CaM presents a unified framework that uses constraint-aware visual programming to enable open-set reactive and proactive failure detection in robotics. By extracting constraint elements via ConSeg, translating constraints into executable monitor code with GPT-4o, and tracking elements in real time, CaM achieves high-precision, low-latency monitoring across simulators and real-world tests. The approach demonstrates significant improvements in success rate and execution time over baselines, and supports closed-loop operation with open-loop policies for long-horizon tasks in cluttered, dynamic environments. The combination of constraint-aware segmentation, multi-view observation, and code-based monitoring provides a principled path toward generalizable, real-time failure detection in diverse robotic manipulation tasks.

Abstract

Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

TL;DR

CaM presents a unified framework that uses constraint-aware visual programming to enable open-set reactive and proactive failure detection in robotics. By extracting constraint elements via ConSeg, translating constraints into executable monitor code with GPT-4o, and tracking elements in real time, CaM achieves high-precision, low-latency monitoring across simulators and real-world tests. The approach demonstrates significant improvements in success rate and execution time over baselines, and supports closed-loop operation with open-loop policies for long-horizon tasks in cluttered, dynamic environments. The combination of constraint-aware segmentation, multi-view observation, and code-based monitoring provides a principled path toward generalizable, real-time failure detection in diverse robotic manipulation tasks.

Abstract

Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.

Paper Structure

This paper contains 48 sections, 2 equations, 24 figures, 11 tables.

Figures (24)

  • Figure 1: For the task "Move the pan with lobster to the stove without losing the lobster", (a) reactive failure detection identifies failures after they occur, and (b) proactive failure detection prevents foreseeable failures. In (a), at $t^R_4$, the robot detects the failure after the lobster unpredictably jumps out due to the heat. In (b), pan tilting is detected at $t^P_3$ and corrected it at $t^{P'}_3$, requiring real-time precision. We formulate both tasks as spatio-temporal constraint satisfaction problems, leveraging our proposed constraint elements for precise, real-time checking. For example, in (a), a large relative distance between pan and lobster indicates failure; in (b), a large angle between the pan and the horizontal plane needs correction. (c) shows that our method combined with an open-loop policy forms a closed-loop system, enabling proactive (e.g., detecting moving glass during grasping) and reactive (e.g., removing toy after grasping) failure detection in cluttered scenes.
  • Figure 2: Overview of Code-as-Monitor. Given task instructions and prior information, the Constraint Generator derives the next subgoal and corresponding textual constraints based on multi-view observations. The Painter maps these constraints onto images as constraint elements. The Monitor generates monitor code from these images and tracks them for real-time monitoring. If any constraint is violated, it outputs the reason for failure and triggers re-planning. This framework unifies reactive and proactive failure detection via constraints, more generally abstracts relevant entities/parts through constraint elements, and ensures precise and real-time monitoring via code evaluation.
  • Figure 3: Constraint Element Pipeline. Given a constraint, our model ConSeg generates instance-level and part-level masks across multiple views, which are projected into 3D space. Through a series of heuristics, the desired elements are produced. Once all elements are obtained, they are annotated onto the original multi-view images. Here we display the annotation result of one element.
  • Figure 4: ConSeg architecture. Here we display the part-level segmentation, which will output the desired element type and mask.
  • Figure 5: Example of Real-world Evaluation. The red bounding box shows the current grasp target, which may shift due to environmental changes. CaM monitors and adapts to these changes in real-time, resulting in a closed-loop system with an open-loop policy.
  • ...and 19 more figures