MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

Sagnik Majumder; Anish Nethi; Ziad Al-Halah; Kristen Grauman

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

Sagnik Majumder, Anish Nethi, Ziad Al-Halah, Kristen Grauman

Abstract

We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

Abstract

Paper Structure (32 sections, 4 equations, 6 figures, 1 table)

This paper contains 32 sections, 4 equations, 6 figures, 1 table.

Introduction
Related Work
Mistake detection in procedural videos.
Online video understanding.
Early recognition in videos.
Early mistake detection task
Task definition.
Approach
Mistake detector $\mathcal{D}$
Exit policy $\pi$
Inputs and encoding.
Policy network.
Model training
Mistake detector training.
Policy training.
...and 17 more sections

Figures (6)

Figure 1: Our goal is to learn a policy that, given a streaming video of a keystep in a procedural activity, decides how much of the video to observe before exiting (, stopping inference), such that a mistake detector conditioned on the recently observed frames can accurately determine whether the keystep is a mistake or not while minimizing the fraction of the video observed. , in a video like the one shown in row 2, a well-trained model may observe a drill bit (frame 3) and infer that the step will eventually result in a mistake---the glass breakage in later frames confirms this---and exit promptly, thereby using only about 12% of the video.
Figure 2: Our MistExit model for early mistake detection has two components: 1) a mistake detector $\mathcal{D}$ (top), and 2) an exit policy $\pi$ (bottom). At each timestep in a streaming keystep clip, $\mathcal{D}$ processes recent frames to predict the keystep’s mistake label and anticipate future features, improving the mistake detection quality. The policy $\pi$ takes the detector’s estimate and the latest frame, aggregates them over time, and decides when to exit. We train $\pi$ with a novel reward that encourages improving detection quality over time while promoting early and accurate exits.
Figure 3: Early mistake detection results. Higher AP, lower OR is better.
Figure 4: Left: Ablation on larger-scale CaptainCook4D (CC4D) peddi2024NeurIPScaptaincook4d. Right: Early mistake detection results on larger-scale CC4D for different lengths ($L$) of the anticipated feature sequence in our mistake detector (Sec. \ref{['sec:detector']}). Higher AP, lower OR is better for both plots.
Figure 5: Our model's successful predictions on CaptainCook4D peddi2024NeurIPScaptaincook4d (top 2 rows) and Assembly101 sener2022CVPRassembly101 (bottom 2 rows). Our model correctly detects mistakes by identifying cues such as incorrect technique---for example, the knife positioned to produce an abnormally thin slice in row 1---and signs of struggle by the actor, such as repeatedly moving the cabin back and forth in row 3. It can also correctly predict that a step will end in a successful execution by leveraging cues that indicate the correctness of the remaining portion of the step---for instance, a closed pepper container in row 2 may suggest that the actor will not add extra pepper to the eggs, while a correctly installed wheel in row 4 may indicate that the remaining wheels will also be installed correctly.
...and 1 more figures

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

Abstract

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

Authors

Abstract

Table of Contents

Figures (6)