Table of Contents
Fetching ...

Procedural Mistake Detection via Action Effect Modeling

Wenliang Guo, Yujiang Pu, Yu Kong

TL;DR

Procedural mistake detection is advanced by jointly modeling action execution and its outcomes. The authors introduce Action Effect Modeling (AEM), which selects informative effect frames and distills object-state and spatial-relations cues from visual grounding and symbolic scene graphs into a unified effect-aware representation. A prompt-based, one-class detector leverages these representations to identify mistakes, achieving state-of-the-art results on EgoPER and CaptainCook4D. The framework demonstrates the power of combining execution dynamics with outcome-aware signals for robust, context-aware mistake detection in egocentric tasks.

Abstract

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

Procedural Mistake Detection via Action Effect Modeling

TL;DR

Procedural mistake detection is advanced by jointly modeling action execution and its outcomes. The authors introduce Action Effect Modeling (AEM), which selects informative effect frames and distills object-state and spatial-relations cues from visual grounding and symbolic scene graphs into a unified effect-aware representation. A prompt-based, one-class detector leverages these representations to identify mistakes, achieving state-of-the-art results on EgoPER and CaptainCook4D. The framework demonstrates the power of combining execution dynamics with outcome-aware signals for robust, context-aware mistake detection in egocentric tasks.

Abstract

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Examples of two types of mistakes: object positional and state mistakes, in cooking scenario from CaptainCook4D dataset peddi2023captaincook4d.
  • Figure 2: Framework overview.
  • Figure 3: Overview of the Action Effect Modeling (AEM) module. Effect frame sampling and multimodal knowledge extraction are only used to learn effect state and relation projectors $\Theta_s, \Theta_r$ during training. Both of them are skipped during testing.
  • Figure 4: Examples of mistakes occurring in different actions. The right bar charts show mistake probabilities predicted by models without (in blue) and with (in orange) effect modeling. Red boxes in images are only used to highlight mistake regions for clearer visualization.