Table of Contents
Fetching ...

GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Jiafeng Liang, Shixin Jiang, Zekun Wang, Haojie Pan, Zerui Chen, Zheng Chu, Ming Liu, Ruiji Fu, Zhongyuan Wang, Bing Qin

TL;DR

The paper tackles the challenge of instructional video comprehension by introducing GUIDE, a guideline-guided dataset that provides task-level guidelines in addition to per-video step annotations. It builds a three-stage pipeline (video collection, automatic annotation with SP-Generator and GL-Generator, and manual refinement) to produce 560 tasks, 3.5K videos, 15K step segments, and 560 guidelines, enabling three evaluation tasks: Step Captioning, Guideline Summarization, and Guideline-Guided Captioning. Extensive experiments compare video foundation models, language foundation models, and humans, revealing that guidelines substantially improve caption quality and that cross-video guideline mining hinges on solid single-video understanding, with visual encoders identified as a key bottleneck. The results highlight the value of explicit guidelines for learning procedures from open-domain instructional videos and establish GUIDE as a practical benchmark for future research in instructional video comprehension and education-technology applications.

Abstract

There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.

GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

TL;DR

The paper tackles the challenge of instructional video comprehension by introducing GUIDE, a guideline-guided dataset that provides task-level guidelines in addition to per-video step annotations. It builds a three-stage pipeline (video collection, automatic annotation with SP-Generator and GL-Generator, and manual refinement) to produce 560 tasks, 3.5K videos, 15K step segments, and 560 guidelines, enabling three evaluation tasks: Step Captioning, Guideline Summarization, and Guideline-Guided Captioning. Extensive experiments compare video foundation models, language foundation models, and humans, revealing that guidelines substantially improve caption quality and that cross-video guideline mining hinges on solid single-video understanding, with visual encoders identified as a key bottleneck. The results highlight the value of explicit guidelines for learning procedures from open-domain instructional videos and establish GUIDE as a practical benchmark for future research in instructional video comprehension and education-technology applications.

Abstract

There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.

Paper Structure

This paper contains 45 sections, 3 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The steps in the previous dataset were very trivial and unsystematic, making it difficult for beginners to learn. In contrast, our dataset provides structured guideline-guided steps. Such guideline is a common pattern shared by all videos related to the same task.
  • Figure 2: Overview of the Guide dataset. The Guide consists of 560 task queries, each containing an average of 6.2 task-related videos. These instructional videos are divided into specific steps with timestamps and text descriptions (yellow area). Additionally, each task contains a set of guideline steps representing a common pattern shared by all task-related videos (purple area).
  • Figure 3: Overview of Automatic Annotation. (a) Transcribing the video into textual subtitles and generating specific steps based on subtitles. (b) Clustering the task-related videos and generating a set of guideline steps for the cluster with the highest number of videos.
  • Figure 4: Task category distribution of Guide. There are a wide variety of categories for our videos. The most frequent categories are 'Food', 'Cosmetic', and 'Craft'.
  • Figure 5: Comparison of foundation models and ground-truth annotation for step captioning, guideline summarization and guideline-guided captioning. Green, yellow, and red text denote 'correct', 'partially correct', and 'wrong' respectively.
  • ...and 5 more figures