Table of Contents
Fetching ...

Learning and Verification of Task Structure in Instructional Videos

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell

TL;DR

This work introduces VideoTaskformer, a pre-trained video model that learns task-aware step representations by masking out steps and exploiting the global context of the entire instructional video. By predicting weakly supervised textual step labels for masked segments, the model captures both semantics and temporal structure, enabling verification of unseen videos and forecasting of future steps. The authors present three new benchmarks—mistake step detection, mistake ordering detection, and long-term step forecasting—alongside evaluations on existing tasks, showing consistent improvements over strong baselines. They also demonstrate strong performance on COIN and EPIC-KITCHENS-100, underscoring the approach's potential for scalable, context-aware instruction understanding and interactive assistance.

Abstract

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

Learning and Verification of Task Structure in Instructional Videos

TL;DR

This work introduces VideoTaskformer, a pre-trained video model that learns task-aware step representations by masking out steps and exploiting the global context of the entire instructional video. By predicting weakly supervised textual step labels for masked segments, the model captures both semantics and temporal structure, enabling verification of unseen videos and forecasting of future steps. The authors present three new benchmarks—mistake step detection, mistake ordering detection, and long-term step forecasting—alongside evaluations on existing tasks, showing consistent improvements over strong baselines. They also demonstrate strong performance on COIN and EPIC-KITCHENS-100, underscoring the approach's potential for scalable, context-aware instruction understanding and interactive assistance.

Abstract

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
Paper Structure (15 sections, 4 equations, 8 figures, 7 tables)

This paper contains 15 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Prior work lin2022learninglei2021less learns step representations from single short video clips, independent of the task, thus lacking knowledge of task structure. Our model, VideoTaskformer, learns step representations for masked video steps through the global context of all surrounding steps in the video, making our learned representations aware of task semantics and structure.
  • Figure 2: VideoTaskformer Pre-training (Left). VideoTaskformer $f_{\text{VT}}$ learns step representations for the masked out video clip $v_i$, while attending to the other clips in the video. It consists of a video encoder $f_{\text{vid}}$, a step transformer $f_{\text{trans}}$, and a linear layer $f_{\text{head}}$, and is trained using weakly supervised step labels. Downstream Tasks (Right). We evaluate step representations learned from VideoTaskformer on 6 downstream tasks.
  • Figure 3: Qualitative results. We show qualitative results of our method on 4 tasks. The step labels are not used during training and are only shown here for illustrative purposes.
  • Figure 4: Qualitative comparison. We compare results from our method VideoTF to the baseline LwDS on the short-term forecasting task. Step labels are not passed to the model as input and are only for reference.
  • Figure F1: Step classification. We qualitatively compare results from our method (VideoTaskeformer) to the baseline LwDS on the step classification task. While the inputs are video clips, we only show a keyframe from the clip for visualization purposes. Correct predictions (VideoTaskformer) are shown in green and incorrect predictions (LwDS) are in red. We also show a frame from the clip corresponding to the incorrect prediction made by LwDS.
  • ...and 3 more figures