Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso
TL;DR
MATT addresses the need for fine-grained mistake understanding in egocentric video by defining semantic, temporal, and spatial attributions between instructional text and user action. A data engine, MisEngine, automatically constructs large-scale attribution-rich datasets (Ego4D-M and EPIC-KITCHENS-M) by SRL-guided cross-matching and inheritance of annotations, enabling robust supervision. MisFormer, a unified transformer-based model, jointly learns semantic misalignment, localizes the Point-of-No-Return, and grounds the corresponding mistake region, outperforming specialized baselines across semantic, temporal, spatial tasks and mistake detection. The work advances physically grounded instructional AI by providing scalable data and a single architecture capable of comprehensive mistake attribution, with practical implications for feedback and self-improvement in real-world tasks.
Abstract
We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
