Table of Contents
Fetching ...

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

Saelyne Yang, Anh Truong, Juho Kim, Dingzeyu Li

TL;DR

VideoMix tackles the challenge of learning from multiple how-to videos by organizing task information around outcomes, approaches, steps, and methods using a Vision-Language Model pipeline. The Dynamic Approach Identification module compiles varied procedural paths across videos, while a two-page interface presents concise overviews and detailed method options with video snippets. Technical and user studies show VideoMix achieves comparable accuracy to human-annotated baselines but with substantially higher coverage of diverse approaches, and participants report improved understanding and a more tailored learning experience. The work demonstrates a practical, task-oriented alternative to conventional, single-video learning platforms with clear implications for scalable, multi-video instructional systems.

Abstract

Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

VideoMix: Aggregating How-To Videos for Task-Oriented Learning

TL;DR

VideoMix tackles the challenge of learning from multiple how-to videos by organizing task information around outcomes, approaches, steps, and methods using a Vision-Language Model pipeline. The Dynamic Approach Identification module compiles varied procedural paths across videos, while a two-page interface presents concise overviews and detailed method options with video snippets. Technical and user studies show VideoMix achieves comparable accuracy to human-annotated baselines but with substantially higher coverage of diverse approaches, and participants report improved understanding and a more tailored learning experience. The work demonstrates a practical, task-oriented alternative to conventional, single-video learning platforms with clear implications for scalable, multi-video instructional systems.

Abstract

Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.

Paper Structure

This paper contains 53 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: VideoMix interface on the Overview page for the task "Build a Desk’’. (A) Users begin by selecting the task they want to learn. (B) VideoMix then presents video search results categorized by outcome types. (C) For each outcome type, users can choose from standard, simple, or complex approaches. (D) Based on the chosen approach, VideoMix displays the necessary requirements, such as materials, ingredients, and tools. Finally, (E) users can see a list of steps and a brief description of each step that makes up the chosen approach.
  • Figure 2: VideoMix interface on the Details page for the task "Build a Desk’’. (A) The interface displays the list of steps for the chosen approach. (B) For each step, users can explore different methods, such as tools or techniques, to complete the step. (C) When a method is selected, VideoMix presents video snippets relevant to that method. (D) Users can easily switch between different videos for the selected method, with the corresponding time frame playing automatically. (E) Additionally, users can view tips and notes extracted from the videos.
  • Figure 3: Illustration of our Dynamic Approach Identification (DAI) module, which captures a variety of approaches to accomplish a task. (a) The process begins by extracting step information from the first video using GPT-4o. This initial step taxonomy is then applied to the next video, where additional steps are identified, refining the taxonomy. This iterative process continues for all videos, progressively refining the step taxonomy with each comparison. (b) Once the final step taxonomy is established, it is reapplied to each video to detect relevant steps and align segments accordingly. Note that not all steps may be present in each video. (c) After extracting step information from each video using the common taxonomy, the system identifies standard, simple, and complex approaches based on the number of videos that follow each approach and the number of steps within each approach.
  • Figure 4: Participants felt they were more successful and efficient with VideoMix, and found VideoMix to be more useful when learning about the task compared to the baseline. There were no statistically significant differences in mental demand, effort, and frustration (*: p<0.05).
  • Figure 5: Participants' ratings on the usefulness of each information piece in understanding the task. Overall, they found the information provided by VideoMix—including outcome types, requirements, different approaches, step details, methods, and tips and notes—to be helpful in gaining a better understanding of the task.