Table of Contents
Fetching ...

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, Wei-Shi Zheng

TL;DR

An Adaptive Multiple Normal Action Representation (AMNAR) framework that predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors.

Abstract

Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

TL;DR

An Adaptive Multiple Normal Action Representation (AMNAR) framework that predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors.

Abstract

Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.

Paper Structure

This paper contains 31 sections, 20 equations, 8 figures, 11 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of error detection using multiple valid next actions at time $t$. After "Grinding Coffee Bean" at time $t-1$, valid next actions include "Boil Water" and "Prepare Filter." The best matching action is selected and compared with the ongoing action. If their distance exceeds the threshold, the action is marked as an error; otherwise, it is marked as normal.
  • Figure 2: Overview of the Adaptive Multiple Normal Action Representation (AMNAR) framework. The process begins with an Action Segmentation module identifying executed actions from video input. (a) Potential Action Prediction Block predicts valid next actions using a task graph from executed action labels. (b) Representation Reconstruction Block generates normal action representations for these valid actions, leveraging temporal visual features. (c) Representations Matching Block compares the ongoing action’s feature at time t with the generated representations to detect errors, indicated by a checkmark (✓) for normal actions or a cross (✗) for errors.
  • Figure 3: Overview of the Potential Action Prediction Block. Using Dynamic Programming (DP), this module identifies all longest common subsequences (lcs) from the executed action sequence $s_t$ via the task graph $G$. These lcs are interconnected into a unified subgraph, forming the filtered sequence $s_t^*$. Reachable child nodes from $G$ are then extracted as valid next actions $C_t$.
  • Figure 4: Error detection when the Action Segmentation Model (ASM) misclassifies an action. AMNAR correctly identifies the action as normal. In contrast, the EgoPED framework incorrectly detects a false positive.
  • Figure 5: The Potential Action Prediction Block (PAPB) derives the longest matching subsequence from the executed sequence using the task graph. This subsequence is then used to identify all reachable nodes, representing valid next actions. This figure is reproduced from the main text for reference.
  • ...and 3 more figures