VideoMix: Aggregating How-To Videos for Task-Oriented Learning
Saelyne Yang, Anh Truong, Juho Kim, Dingzeyu Li
TL;DR
VideoMix tackles the challenge of learning from multiple how-to videos by organizing task information around outcomes, approaches, steps, and methods using a Vision-Language Model pipeline. The Dynamic Approach Identification module compiles varied procedural paths across videos, while a two-page interface presents concise overviews and detailed method options with video snippets. Technical and user studies show VideoMix achieves comparable accuracy to human-annotated baselines but with substantially higher coverage of diverse approaches, and participants report improved understanding and a more tailored learning experience. The work demonstrates a practical, task-oriented alternative to conventional, single-video learning platforms with clear implications for scalable, multi-video instructional systems.
Abstract
Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.
