Table of Contents
Fetching ...

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini, Luca Scofano, Edoardo De Matteis, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Andrea Sanchietti, Giovanni Maria Farinella, Fabio Galasso, Antonino Furnari

TL;DR

TI-PREGO addresses online open-set procedural mistake detection in egocentric videos by fusing real-time action recognition with LLM-based anticipation. The dual-branch design uses an online recognizer ρ and a symbolic anticipator ξ driven by Automatic Chain-of-Thought to detect mistakes as mismatches, aided by frame-aggregation strategies. The authors introduce Assembly101-O and Epic-tent-O as online benchmarks and demonstrate state-of-the-art performance with TI-PREGO, including prompting variants and ablations. The work enables timely, robust feedback for skill training and industrial tasks, highlighting the potential of integrating vision, symbolic reasoning, and LLMs for online procedural monitoring.

Abstract

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

TL;DR

TI-PREGO addresses online open-set procedural mistake detection in egocentric videos by fusing real-time action recognition with LLM-based anticipation. The dual-branch design uses an online recognizer ρ and a symbolic anticipator ξ driven by Automatic Chain-of-Thought to detect mistakes as mismatches, aided by frame-aggregation strategies. The authors introduce Assembly101-O and Epic-tent-O as online benchmarks and demonstrate state-of-the-art performance with TI-PREGO, including prompting variants and ablations. The work enables timely, robust feedback for skill training and industrial tasks, highlighting the potential of integrating vision, symbolic reasoning, and LLMs for online procedural monitoring.

Abstract

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

Paper Structure

This paper contains 35 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Procedural Mistake Detection involves identifying errors within a procedural video. Each procedure is composed of different steps that should be executed with a certain order. The aim is to develop a method capable of analyzing a video and determining whether each frame contains a mistake, e.g., a step that is not executed in the correct order. Addressing this task is particularly useful when applied to wearable devices, as they allow for direct feedback to the person performing the task. In the Figure, TI-PREGO, depicted by a gear icon, takes as input the video sequence from time $0$ to $t-1$, classifying the current frame $t$ as either correct (green) or a mistake (red). This process continues frame by frame until a mistake is detected.
  • Figure 2: Our proposed model is based on two main components. The recognition module (orange) processes the input video in an online fashion and predicts actions observed at each timestep represented as a square. This prediction is combined with previous ones to minimize the noise generated by per-frame predictions. The last element of the aggregated output corresponds to the current step predicted by the recognition module, while the sequence without duplicates is provided to the anticipation module. The anticipation module (blue) reasons symbolically via a Large Language Model, utilizing automatic Chain of Thought (ACoT) reasoning to predict the future action based on past action history and a brief context such as instances of other step sequences. Mistakes are identified when the current action detected by the step recognition method differs from the one forecasted by the step anticipation module.
  • Figure 3: Distribution of the 20 most frequent classes in Assembly101-O
  • Figure 4: The three aggregation strategies operate differently: NOMA employs a non-overlapping window and replaces the entire window content with its mode. In contrast, both OMA and OCMA use overlapping windows with a sliding step of 1. However, OMA replaces the last frame of the window with its mode, whereas OCMA replaces the central frame.
  • Figure 5: The modification occurs in two steps: starting from Assembly101 (a), first, a new train/test split assigns all correct procedures to the training set, reserving videos with mistakes for testing (b). Second, the lengths of procedures containing mistakes are adjusted by trimming them to the first mistake, preventing the creation of corrupted sequences (c). This setup enables models to learn correct sequences within a one-class classification (OCC) framework, treating any deviation as a mistake and ensuring a balanced test set for effective mistake detection.
  • ...and 4 more figures