Table of Contents
Fetching ...

TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks

Yuexi Chen, Vlad I. Morariu, Anh Truong, Zhicheng Liu

TL;DR

TutoAI addresses the challenge of converting linear instructional videos into browsable mixed-media tutorials by proposing a cross-domain AI-assisted framework with three hierarchical levels: components (steps, objects, dependencies), models (selection, assembly, evaluation of multi-modal extractors), and UI design. It identifies and codifies the core tutorial components, develops candidate multi-model pipelines for extracting these components from instructional videos, and demonstrates a final integrated pipeline that leverages LLMs for step text and timestamps and open-vocabulary detectors for objects, all presented via a creator-focused UI. Through model-level evaluation on cooking videos and two user studies (general viewers and YouTube creators), TutoAI shows higher or comparable component quality to a YouTube baseline and demonstrates potential for integration into creators’ workflows. The work offers practical guidelines for cross-domain model selection, multi-modal pipeline assembly, and user-centric UI design, advancing the state of AI-assisted tutorial creation and enabling scalable generalization across domains.

Abstract

Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.

TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks

TL;DR

TutoAI addresses the challenge of converting linear instructional videos into browsable mixed-media tutorials by proposing a cross-domain AI-assisted framework with three hierarchical levels: components (steps, objects, dependencies), models (selection, assembly, evaluation of multi-modal extractors), and UI design. It identifies and codifies the core tutorial components, develops candidate multi-model pipelines for extracting these components from instructional videos, and demonstrates a final integrated pipeline that leverages LLMs for step text and timestamps and open-vocabulary detectors for objects, all presented via a creator-focused UI. Through model-level evaluation on cooking videos and two user studies (general viewers and YouTube creators), TutoAI shows higher or comparable component quality to a YouTube baseline and demonstrates potential for integration into creators’ workflows. The work offers practical guidelines for cross-domain model selection, multi-modal pipeline assembly, and user-centric UI design, advancing the state of AI-assisted tutorial creation and enabling scalable generalization across domains.

Abstract

Mixed-media tutorials, which integrate videos, images, text, and diagrams to teach procedural skills, offer more browsable alternatives than timeline-based videos. However, manually creating such tutorials is tedious, and existing automated solutions are often restricted to a particular domain. While AI models hold promise, it is unclear how to effectively harness their powers, given the multi-modal data involved and the vast landscape of models. We present TutoAI, a cross-domain framework for AI-assisted mixed-media tutorial creation on physical tasks. First, we distill common tutorial components by surveying existing work; then, we present an approach to identify, assemble, and evaluate AI models for component extraction; finally, we propose guidelines for designing user interfaces (UI) that support tutorial creation based on AI-generated components. We show that TutoAI has achieved higher or similar quality compared to a baseline model in preliminary user studies.
Paper Structure (49 sections, 1 equation, 9 figures, 1 table)

This paper contains 49 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Examples of steps in mixed-media tutorials (images used with permission)
  • Figure 2: Examples of objects in mixed-media tutorials (images used with permission).
  • Figure 3: Dependency examples in mixed-media tutorials (images used with permission).
  • Figure 4: Four candidate pipelines for step extraction. Models are in green, and generated subcomponents are in blue. After evaluation, the chosen one is No.2.
  • Figure 5: Three candidate pipelines for object extraction. Models are in green, and generated subcomponents are in blue. After evaluation, the chosen one is No.2.
  • ...and 4 more figures