Table of Contents
Fetching ...

PREGO: online mistake detection in PRocedural EGOcentric videos

Alessandro Flaborea, Guido Maria D'Amely di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, Fabio Galasso

TL;DR

This work addresses the challenge of online open-set procedural mistake detection in egocentric videos. It presents PREGO, a dual-branch system that combines online step recognition with symbolically driven next-step anticipation via a Large Language Model, detecting mistakes when the current action diverges from the predicted next action. To evaluate online open-set performance, the authors introduce Assembly101-O and Epic-tent-O benchmarks derived from Assembly101 and Epic-tent, along with standard precision, recall, and F1 metrics. Experimental results show that PREGO, especially with LLama-based symbolic reasoning, outperforms baselines and demonstrates practical potential for real-time monitoring in industries like manufacturing and healthcare.

Abstract

Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively.

PREGO: online mistake detection in PRocedural EGOcentric videos

TL;DR

This work addresses the challenge of online open-set procedural mistake detection in egocentric videos. It presents PREGO, a dual-branch system that combines online step recognition with symbolically driven next-step anticipation via a Large Language Model, detecting mistakes when the current action diverges from the predicted next action. To evaluate online open-set performance, the authors introduce Assembly101-O and Epic-tent-O benchmarks derived from Assembly101 and Epic-tent, along with standard precision, recall, and F1 metrics. Experimental results show that PREGO, especially with LLama-based symbolic reasoning, outperforms baselines and demonstrates practical potential for real-time monitoring in industries like manufacturing and healthcare.

Abstract

Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively.
Paper Structure (27 sections, 3 equations, 4 figures, 5 tables)

This paper contains 27 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: PREGO is based on two main components: The recognition module (top) processes the input video in an online fashion and predicts actions observed at each timestep; the anticipation module (bottom) reasons symbolically via a Large Language Model to predict the future action based on past action history and a brief context, such as instances of other action sequences. Mistakes are identified when the current action detected by the step recognition method differs from the one forecasted by the step anticipation module (right).
  • Figure 2: Two different representations of the actions in the prompt for the LLM model. On the left, the prompt is represented using symbolic labels. On the right, the prompt encompasses the names of the actions in the transcript. The context part of the prompt is fixed and retrieved from the dataset, while the recognition module extracts the current sequence.
  • Figure 3: Three different variants, defining different inputs to the LLMs. On the left, the prompt lacks any reference to sequences or symbols to be completed. In the center, the prompt consists of detailed and lengthier requests. On the right, the prompt incorporates the context of the sequence explicitly. This third variant performs best and it is therefore adopted in PREGO.
  • Figure 4: Epic-tent-O split between train and test set based on the self-confidence of actors while performing the procedure. The videos with id between $[1,7]$ do not have confidence score annotations and are included in the test set.