Table of Contents
Fetching ...

Leveraging Procedural Knowledge and Task Hierarchies for Efficient Instructional Video Pre-training

Karan Samel, Nitish Sontakke, Irfan Essa

TL;DR

The paper tackles efficient instructional video pre-training by leveraging explicit procedural steps and hierarchical task structures. It introduces Pivot, a Procedural-Hierarchical Integrated Video Transformer, that jointly predicts clip-level steps and video-level hierarchy paths during pre-training, aided by clip selection and ordering augmentations and an analytically derived early stopping criterion. On HowTo100M-derived data aligned with wikiHow and a 3-level hierarchy, Pivot achieves superior downstream performance on COIN and CrossTask across task recognition, step recognition, and step forecasting, especially in low-resource settings. This work demonstrates that explicit structured knowledge about procedures and tasks yields data-efficient representations for instructional video recommendation.

Abstract

Instructional videos provide a convenient modality to learn new tasks (ex. cooking a recipe, or assembling furniture). A viewer will want to find a corresponding video that reflects both the overall task they are interested in as well as contains the relevant steps they need to carry out the task. To perform this, an instructional video model should be capable of inferring both the tasks and the steps that occur in an input video. Doing this efficiently and in a generalizable fashion is key when compute or relevant video topics used to train this model are limited. To address these requirements we explicitly mine task hierarchies and the procedural steps associated with instructional videos. We use this prior knowledge to pre-train our model, $\texttt{Pivot}$, for step and task prediction. During pre-training, we also provide video augmentation and early stopping strategies to optimally identify which model to use for downstream tasks. We test this pre-trained model on task recognition, step recognition, and step prediction tasks on two downstream datasets. When pre-training data and compute are limited, we outperform previous baselines along these tasks. Therefore, leveraging prior task and step structures enables efficient training of $\texttt{Pivot}$ for instructional video recommendation.

Leveraging Procedural Knowledge and Task Hierarchies for Efficient Instructional Video Pre-training

TL;DR

The paper tackles efficient instructional video pre-training by leveraging explicit procedural steps and hierarchical task structures. It introduces Pivot, a Procedural-Hierarchical Integrated Video Transformer, that jointly predicts clip-level steps and video-level hierarchy paths during pre-training, aided by clip selection and ordering augmentations and an analytically derived early stopping criterion. On HowTo100M-derived data aligned with wikiHow and a 3-level hierarchy, Pivot achieves superior downstream performance on COIN and CrossTask across task recognition, step recognition, and step forecasting, especially in low-resource settings. This work demonstrates that explicit structured knowledge about procedures and tasks yields data-efficient representations for instructional video recommendation.

Abstract

Instructional videos provide a convenient modality to learn new tasks (ex. cooking a recipe, or assembling furniture). A viewer will want to find a corresponding video that reflects both the overall task they are interested in as well as contains the relevant steps they need to carry out the task. To perform this, an instructional video model should be capable of inferring both the tasks and the steps that occur in an input video. Doing this efficiently and in a generalizable fashion is key when compute or relevant video topics used to train this model are limited. To address these requirements we explicitly mine task hierarchies and the procedural steps associated with instructional videos. We use this prior knowledge to pre-train our model, , for step and task prediction. During pre-training, we also provide video augmentation and early stopping strategies to optimally identify which model to use for downstream tasks. We test this pre-trained model on task recognition, step recognition, and step prediction tasks on two downstream datasets. When pre-training data and compute are limited, we outperform previous baselines along these tasks. Therefore, leveraging prior task and step structures enables efficient training of for instructional video recommendation.

Paper Structure

This paper contains 20 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: We leverage both task hierarchical data as well as procedural step information to pre-train our instructional video model $\texttt{Pivot}$.
  • Figure 2: Our model, $\texttt{Pivot}$, pre-trains on instructional videos to predict: a) which procedural steps belong to each video clip (left), b) where in the video hierarchy the current video belongs to (center), and c) video clip augmentation and training procedures to learn the joint clip-video representations most effectively (right).
  • Figure 3: Pre-trained models from different epochs are tested on downstream COIN task recognition (red line) and step recognition (blue line) tasks. The derivative of the clip step accuracy $p'(x)$ is also plotted (black line), where the max value represents the analytical early stopping point (dashed line). The saturation based early stopping with no improvements over 50 epochs is also presented as a reference (dotted line).