Table of Contents
Fetching ...

TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

Nerea Gallego, Fernando Salanova, Claudio Mannarano, Cristian Mahulea, Eduardo Montijano

TL;DR

A new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks and can be trained with weak supervision, requiring only a single label per video.

Abstract

As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/

TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

TL;DR

A new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks and can be trained with weak supervision, requiring only a single label per video.

Abstract

As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/
Paper Structure (31 sections, 5 equations, 5 figures, 3 tables)

This paper contains 31 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: TIMID: proposed architecture to identify time-dependent mistakes in videos of Robot Executions. Our model takes a video and two prompts as input and outputs a mistake prediction at frame-level. The Video Anomaly Detection inspired architecture allows for weakly-supervised training using only video-level annotations.
  • Figure 2: Overview of the proposed time-dependent mistake detection pipeline. The system processes video streams, tasks, and mistake descriptions to identify semantic and temporal deviations from high-level task objectives.
  • Figure 3: Description of the tasks contained in the dataset. The top-half shows a mutual exclusion task, focused on concurrency. The lower-half shows an ordering task, with emphasis on time. The dataset includes frame and video-level annotations.
  • Figure 4: Different examples of frames in the multi-robot dataset, captured across multiple points of view and localizations.
  • Figure 5: Figure with examples of multiple predictions of the models across the different benchmarks. On top of the predictions, frames of the execution exemplyfing the videos. Below the frames are the predictions of our model (in green) and the different baselines (Qwen in blue and PEL4VAD in yellow) are shown against the ground truth (in red). (The top left example shows a example of a synthetic execution of the ordering case, top right a real video of the proximity case, down left examples of the bridge dataset and down right some incorrect executions with examples of false positives and negatives).