Table of Contents
Fetching ...

Transparent and Coherent Procedural Mistake Detection

Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai

TL;DR

This paper reframes procedural mistake detection as a transparent task that requires visual rationales produced through iterative self-dialog between a vision-language model and itself. It introduces Ego4D-PMD, a frame-based PMD benchmark, and NLI-based coherence metrics to quantify the quality of generated rationales (relevance and informativeness). By applying coherence-based re-ranking, in-context learning, and DPO/QLoRA fine-tuning, the study demonstrates improved rationale coherence, PMD accuracy, and decision confidence, albeit with trade-offs in latency and complexity. The work also provides rich visualizations of PMD outcomes to diagnose reasoning failures such as object hallucination, and discusses practical considerations and limitations for real-world task guidance systems.

Abstract

Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.

Transparent and Coherent Procedural Mistake Detection

TL;DR

This paper reframes procedural mistake detection as a transparent task that requires visual rationales produced through iterative self-dialog between a vision-language model and itself. It introduces Ego4D-PMD, a frame-based PMD benchmark, and NLI-based coherence metrics to quantify the quality of generated rationales (relevance and informativeness). By applying coherence-based re-ranking, in-context learning, and DPO/QLoRA fine-tuning, the study demonstrates improved rationale coherence, PMD accuracy, and decision confidence, albeit with trade-offs in latency and complexity. The work also provides rich visualizations of PMD outcomes to diagnose reasoning failures such as object hallucination, and discusses practical considerations and limitations for real-world task guidance systems.

Abstract

Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.

Paper Structure

This paper contains 73 sections, 8 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: To reason through the complex task of procedural mistake detection (PMD), vision-and-language models (VLMs) are conditioned to gather visual evidence through an iterative self-dialog to rationalize their final decision.
  • Figure 2: Selected examples from Ego4D Ego4D2022CVPR for Procedural Mistake Detection (Ego4D-PMD). For each matching pair of a video frame and procedural text, we generate a success example and various mistake examples by sampling alternate video frames: incomplete execution, execution with the wrong verb (e.g., wringing a cloth instead of folding), execution with the wrong noun (e.g., folding paper instead of a cloth), and execution with both the wrong verb and noun (e.g., opening a notepad instead of folding a cloth). Images cropped for space.
  • Figure 3: Using BART lewis-etal-2020-bart fine-tuned on MNLI williamsBroadCoverageChallengeCorpus2017 to judge success.
  • Figure 4: Visualization of decision error, relevance, and reference-adjusted informativeness for configurations of LLaVA on Ego4D-PMD testing examples. Each data point's color indicates its position along each axis.
  • Figure 5: Sample coherent PMD outputs from LLaVA with coherence-based ranking, representing the range of behaviors observed (as visualized in Figure \ref{['fig:cubes']}). Images cropped for visibility and space.
  • ...and 2 more figures