Table of Contents
Fetching ...

Step Differences in Instructional Video

Tushar Nagarajan, Lorenzo Torresani

TL;DR

StepDiff introduces a video-conditioned language model capable of cross-video reasoning by automatically generating paired-video QA data from HowTo100M and fine-tuning a VCLM to compare two videos. It formalizes three tasks—DiffCap, DiffMCQ, and DiffRank—that evaluate describing, recognizing, and ranking differences between a reference and a candidate video. The approach achieves state-of-the-art performance on these tasks and provides a manually curated benchmark with 6292 video pairs (≈36k captions) to advance cross-video instructional understanding. The work enables fine-grained AR/VR personalized guidance, improved video retrieval, and multi-video reasoning for procedural activities, with potential extensions to retrieval and complex QA beyond atomic differences.

Abstract

Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff

Step Differences in Instructional Video

TL;DR

StepDiff introduces a video-conditioned language model capable of cross-video reasoning by automatically generating paired-video QA data from HowTo100M and fine-tuning a VCLM to compare two videos. It formalizes three tasks—DiffCap, DiffMCQ, and DiffRank—that evaluate describing, recognizing, and ranking differences between a reference and a candidate video. The approach achieves state-of-the-art performance on these tasks and provides a manually curated benchmark with 6292 video pairs (≈36k captions) to advance cross-video instructional understanding. The work enables fine-grained AR/VR personalized guidance, improved video retrieval, and multi-video reasoning for procedural activities, with potential extensions to retrieval and complex QA beyond atomic differences.

Abstract

Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff
Paper Structure (39 sections, 1 equation, 14 figures, 6 tables)

This paper contains 39 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Main idea.Top: We train models to compare two videos showing the same high-level keystep and to describe their differences (e.g., in tools, ingredients, technique). Bottom: Once trained, such models can then help answer questions about a user's activity compared to a reference (e.g., an internet how-to video) like "did I do this step right?" or "am I done yet?".
  • Figure 2: Step differences framework. We first generate a comprehensive step description including information from action captions, object detections and ASR narrations (left panel). We then select pairs of clips with similar step descriptions, and automatically generate questions and answers that compare the two (center panel, Sec. \ref{['sec:dataset_gen']}). Finally, we instruction-tune an LLM to generate answers conditioned on the generated questions and encoded representations of both videos (right panel, Sec. \ref{['sec:instruct_tuning']}). Once trained, the model directly operates on video clips to compare them, without the need for captions, ASR or object detections.
  • Figure 3: Evaluation tasks. We evaluate on describing (DiffCap), recognizing (DiffMCQ) and ranking (DiffRank) differences.
  • Figure 4: StepDiff dataset samples We annotate text describing differences in various categories and scores for how different the videos are in each category (1 = very different; 5 = nearly identical). More examples are in Supp.
  • Figure 5: Extended QA on video pairs. Our model which can describe differences (row 1) can be prompted (i.e., queried without any form of retraining) for comparative reasoning (e.g., "why are they different?", "how different are they?" row 2-3), or to bootstrap mistake detection (row 4). A failure case is shown in row 5 due to model hallucination.
  • ...and 9 more figures