Table of Contents
Fetching ...

ConditionNET: Learning Preconditions and Effects for Execution Monitoring

Daniel Sliwowski, Dongheui Lee

TL;DR

ConditionNET addresses the challenge of robust execution monitoring in unstructured robotic environments by learning action preconditions and effects from data using a compact vision-language transformer. The method explicitly conditions on the performed action and optimizes for consistent representations between preconditions, effects, and actions via an InfoNCE-based consistency loss, enabling real-time anomaly detection and recovery. Across two datasets, including a new Panda teleoperation collection, ConditionNET consistently outperforms baselines in anomaly detection and phase prediction, with a modest 30M-parameter footprint and fast inference (~$21\pm 18$ ms per 1000 batches). The work demonstrates practical applicability through a real-robot monitoring system and highlights future opportunities in domain transfer and explainability, supported by publicly available data.

Abstract

The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to optimize for consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments. The data is available on the project website: https://dsliwowski1.github.io/ConditionNET_page.

ConditionNET: Learning Preconditions and Effects for Execution Monitoring

TL;DR

ConditionNET addresses the challenge of robust execution monitoring in unstructured robotic environments by learning action preconditions and effects from data using a compact vision-language transformer. The method explicitly conditions on the performed action and optimizes for consistent representations between preconditions, effects, and actions via an InfoNCE-based consistency loss, enabling real-time anomaly detection and recovery. Across two datasets, including a new Panda teleoperation collection, ConditionNET consistently outperforms baselines in anomaly detection and phase prediction, with a modest 30M-parameter footprint and fast inference (~ ms per 1000 batches). The work demonstrates practical applicability through a real-robot monitoring system and highlights future opportunities in domain transfer and explainability, supported by publicly available data.

Abstract

The introduction of robots into everyday scenarios necessitates algorithms capable of monitoring the execution of tasks. In this paper, we propose ConditionNET, an approach for learning the preconditions and effects of actions in a fully data-driven manner. We develop an efficient vision-language model and introduce additional optimization objectives during training to optimize for consistent feature representations. ConditionNET explicitly models the dependencies between actions, preconditions, and effects, leading to improved performance. We evaluate our model on two robotic datasets, one of which we collected for this paper, containing 406 successful and 138 failed teleoperated demonstrations of a Franka Emika Panda robot performing tasks like pouring and cleaning the counter. We show in our experiments that ConditionNET outperforms all baselines on both anomaly detection and phase prediction tasks. Furthermore, we implement an action monitoring system on a real robot to demonstrate the practical applicability of the learned preconditions and effects. Our results highlight the potential of ConditionNET for enhancing the reliability and adaptability of robots in real-world environments. The data is available on the project website: https://dsliwowski1.github.io/ConditionNET_page.

Paper Structure

This paper contains 18 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An overview of the proposed anomaly detection and recovery algorithm. ConditionNET detects anomalies by comparing the expected and current motion phases, the latter is predicted by a vision-language model. A behavior tree governs task execution, providing the action and expected phases, while motions are generated by a skill library and executed on the robot with an impedance controller.
  • Figure 2: ConditionNET Architecture. For an image-action pair, we compute the condition feature $E$ and classify the current observation as precondition, effect, or unsatisfied. We extract image and semantic features using DINOv2 Dinov2 and CLIP CLIP. The State Transformer then extracts the general state feature $\widehat{cls}$, and the Condition Transformer extracts the condition feature $E$. For consistency loss, we use features from both the precondition frame and the effect frame, denoted $-$ and $+$, respectively. We compute the action feature $e_a$ as the difference between $\widehat{cls^+}$ and $\widehat{cls^-}$. Using InfoNCE loss oord2018representation, we make the action feature "similar" to the paraphrased action description $s_p$, but only for successfully executed actions.
  • Figure 3: Simplified depiction of the Behavior tree. Octagons -- Selector nodes, rectangles -- sequence nodes, rounded rectangles -- behaviors.
  • Figure 4: Qualitative Results show the model performance in the continuous action monitoring experiment. For clarity of presentation, only the results for single actions have been shown, but in reality, predictions for all actions are made in parallel. We highlight less visible objects in the red boxes.
  • Figure 5: Phase prediction confidences and anomaly prediction results over time. Each hue marks a different expected action, and each saturation denotes a different expected motion phase. The most saturated color represents the pre-motion, the medium saturation denotes the core-motion, and the least saturated indicates the post-motion. The images below show snapshots for different points in the execution.