TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini; Luca Scofano; Edoardo De Matteis; Guido Maria D'Amely di Melendugno; Alessandro Flaborea; Andrea Sanchietti; Giovanni Maria Farinella; Fabio Galasso; Antonino Furnari

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

Leonardo Plini, Luca Scofano, Edoardo De Matteis, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Andrea Sanchietti, Giovanni Maria Farinella, Fabio Galasso, Antonino Furnari

TL;DR

TI-PREGO addresses online open-set procedural mistake detection in egocentric videos by fusing real-time action recognition with LLM-based anticipation. The dual-branch design uses an online recognizer ρ and a symbolic anticipator ξ driven by Automatic Chain-of-Thought to detect mistakes as mismatches, aided by frame-aggregation strategies. The authors introduce Assembly101-O and Epic-tent-O as online benchmarks and demonstrate state-of-the-art performance with TI-PREGO, including prompting variants and ablations. The work enables timely, robust feedback for skill training and industrial tasks, highlighting the potential of integrating vision, symbolic reasoning, and LLMs for online procedural monitoring.

Abstract

Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

TL;DR

Abstract

TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)