Table of Contents
Fetching ...

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso

TL;DR

MATT addresses the need for fine-grained mistake understanding in egocentric video by defining semantic, temporal, and spatial attributions between instructional text and user action. A data engine, MisEngine, automatically constructs large-scale attribution-rich datasets (Ego4D-M and EPIC-KITCHENS-M) by SRL-guided cross-matching and inheritance of annotations, enabling robust supervision. MisFormer, a unified transformer-based model, jointly learns semantic misalignment, localizes the Point-of-No-Return, and grounds the corresponding mistake region, outperforming specialized baselines across semantic, temporal, spatial tasks and mistake detection. The work advances physically grounded instructional AI by providing scalable data and a single architecture capable of comprehensive mistake attribution, with practical implications for feedback and self-improvement in real-world tasks.

Abstract

We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

TL;DR

MATT addresses the need for fine-grained mistake understanding in egocentric video by defining semantic, temporal, and spatial attributions between instructional text and user action. A data engine, MisEngine, automatically constructs large-scale attribution-rich datasets (Ego4D-M and EPIC-KITCHENS-M) by SRL-guided cross-matching and inheritance of annotations, enabling robust supervision. MisFormer, a unified transformer-based model, jointly learns semantic misalignment, localizes the Point-of-No-Return, and grounds the corresponding mistake region, outperforming specialized baselines across semantic, temporal, spatial tasks and mistake detection. The work advances physically grounded instructional AI by providing scalable data and a single architecture capable of comprehensive mistake attribution, with practical implications for feedback and self-improvement in real-world tasks.

Abstract

We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

Paper Structure

This paper contains 27 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Mistake Attribution (MATT) task aims to understand the deviation between a human attempt (video) and the instruction (text) along three axes. Semantic attribution identifies what semantic role in the instruction is violated (e.g., a wrong Object "bolt" is mistakenly picked up instead of "hammer"); temporal attribution identifies when the attempt reaches the point of no return (PNR) (e.g., Frame 17); and spatial attribution identifies where, in the PNR frame, the mistake is manifested (e.g., the red bounding box).
  • Figure 2: A significant challenge in mistake video analysis is the paucity of available data in the face is massive diversity of possible mistakes. This figure explains our new data engine, MisEngine, that overcomes this challenge by automatically creating new mistake understanding datasets from source corpora by a careful series of sampling and cross-matching methods. MisEngine uses semantic role labeling on the text instruction and then matches across the available roles; here, we show an example with two roles (object as "Obj" and predicate as "V"). Our resulting datasets are orders of magnitude larger than existing ones and fully annotated (for free) across mistake detection and attribution.
  • Figure 3: MisFormer’s unified architecture for mistake attribution. MisFormer jointly processes the instruction text and an attempt video, extracting shared multimodal features. Three specialized transformer heads perform semantic attribution (detecting misaligned roles), temporal localization (pinpointing the Point-of-No-Return frame), and spatial localization (predicting mistake regions via attention-driven bounding boxes), enabling comprehensive and interpretable mistake analysis across semantic, temporal, and spatial dimensions.
  • Figure 4: Qualitative results of MisFormer. Red text highlights mistaken semantic roles, the red frame marks the Point-of-No-Return (PNR), and the red bounding box localizes the mistake region in the PNR frame. The column in each sample visualizes the per-frame heatmap from the spatial attribution module for reference.