Table of Contents
Fetching ...

Open-Event Procedure Planning in Instructional Videos

Yilu Wu, Hanlin Wang, Jing Wang, Limin Wang

TL;DR

This work defines Open-event Procedure Planning (OEPP), a setting where planners must transfer learned procedural knowledge to unseen events in instructional videos. A new OpenEvent benchmark, created from COIN and CrossTask, provides base and novel action spaces to evaluate transfer, and a simple visual-text matching framework using VideoCLIP encoders tests this capability. Results show clear gains from training-based baselines over open-match strategies, with PDPP* performing well on base tasks but limited transfer to novel events; ablations reveal that combining $L_{ce}$ and $L_{mse}$ and using VideoCLIP features improve transfer effectiveness. The study highlights the challenge of procedural knowledge transfer to novel events and points to future directions including expanding data sources (e.g., WikiHow) and moving toward generative open-event planning for broader real-world applicability.

Abstract

Given the current visual observations, the traditional procedure planning task in instructional videos requires a model to generate goal-directed plans within a given action space. All previous methods for this task conduct training and inference under the same action space, and they can only plan for pre-defined events in the training set. We argue this setting is not applicable for human assistance in real lives and aim to propose a more general and practical planning paradigm. Specifically, in this paper, we introduce a new task named Open-event Procedure Planning (OEPP), which extends the traditional procedure planning to the open-event setting. OEPP aims to verify whether a planner can transfer the learned knowledge to similar events that have not been seen during training. We rebuild a new benchmark of OpenEvent for this task based on existing datasets and divide the events involved into base and novel parts. During the data collection process, we carefully ensure the transfer ability of procedural knowledge for base and novel events by evaluating the similarity between the descriptions of different event steps with multiple stages. Based on the collected data, we further propose a simple and general framework specifically designed for OEPP, and conduct extensive study with various baseline methods, providing a detailed and insightful analysis on the results for this task.

Open-Event Procedure Planning in Instructional Videos

TL;DR

This work defines Open-event Procedure Planning (OEPP), a setting where planners must transfer learned procedural knowledge to unseen events in instructional videos. A new OpenEvent benchmark, created from COIN and CrossTask, provides base and novel action spaces to evaluate transfer, and a simple visual-text matching framework using VideoCLIP encoders tests this capability. Results show clear gains from training-based baselines over open-match strategies, with PDPP* performing well on base tasks but limited transfer to novel events; ablations reveal that combining and and using VideoCLIP features improve transfer effectiveness. The study highlights the challenge of procedural knowledge transfer to novel events and points to future directions including expanding data sources (e.g., WikiHow) and moving toward generative open-event planning for broader real-world applicability.

Abstract

Given the current visual observations, the traditional procedure planning task in instructional videos requires a model to generate goal-directed plans within a given action space. All previous methods for this task conduct training and inference under the same action space, and they can only plan for pre-defined events in the training set. We argue this setting is not applicable for human assistance in real lives and aim to propose a more general and practical planning paradigm. Specifically, in this paper, we introduce a new task named Open-event Procedure Planning (OEPP), which extends the traditional procedure planning to the open-event setting. OEPP aims to verify whether a planner can transfer the learned knowledge to similar events that have not been seen during training. We rebuild a new benchmark of OpenEvent for this task based on existing datasets and divide the events involved into base and novel parts. During the data collection process, we carefully ensure the transfer ability of procedural knowledge for base and novel events by evaluating the similarity between the descriptions of different event steps with multiple stages. Based on the collected data, we further propose a simple and general framework specifically designed for OEPP, and conduct extensive study with various baseline methods, providing a detailed and insightful analysis on the results for this task.
Paper Structure (30 sections, 7 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 7 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of procedure planning and open-event procedure planning. Procedure planning train and infer model under the same action space, while open-event procedure planning conducts inference under both base (Replace Battery On Key To Car) and novel (Replace Battery On TV control) action spaces.
  • Figure 2: Visualization of OpenEvent. We show examples of two clusters, with every two rows coming from the same cluster. Different actions are marked with different colors.
  • Figure 3: The overview of our framework. When prediction horizon $T=4$, given the start and end observations and the action space, we feed them into video and text encoder separately. Then we use several procedure planners to generate $T$ embeddings and calculate the similarity matrix with the action text features. The green grid in the matrix is the ground truth.
  • Figure 4: Visualization of successful results on OpenEvent for $T=3,4$.
  • Figure 5: Visualization of OpenEvent.
  • ...and 1 more figures