Step Differences in Instructional Video
Tushar Nagarajan, Lorenzo Torresani
TL;DR
StepDiff introduces a video-conditioned language model capable of cross-video reasoning by automatically generating paired-video QA data from HowTo100M and fine-tuning a VCLM to compare two videos. It formalizes three tasks—DiffCap, DiffMCQ, and DiffRank—that evaluate describing, recognizing, and ranking differences between a reference and a candidate video. The approach achieves state-of-the-art performance on these tasks and provides a manually curated benchmark with 6292 video pairs (≈36k captions) to advance cross-video instructional understanding. The work enables fine-grained AR/VR personalized guidance, improved video retrieval, and multi-video reasoning for procedural activities, with potential extensions to retrieval and complex QA beyond atomic differences.
Abstract
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff
