Transparent and Coherent Procedural Mistake Detection
Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai
TL;DR
This paper reframes procedural mistake detection as a transparent task that requires visual rationales produced through iterative self-dialog between a vision-language model and itself. It introduces Ego4D-PMD, a frame-based PMD benchmark, and NLI-based coherence metrics to quantify the quality of generated rationales (relevance and informativeness). By applying coherence-based re-ranking, in-context learning, and DPO/QLoRA fine-tuning, the study demonstrates improved rationale coherence, PMD accuracy, and decision confidence, albeit with trade-offs in latency and complexity. The work also provides rich visualizations of PMD outcomes to diagnose reasoning failures such as object hallucination, and discusses practical considerations and limitations for real-world task guidance systems.
Abstract
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
