Stitch-a-Demo: Video Demonstrations from Multistep Descriptions
Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman
TL;DR
Stitch-a-Demo tackles the challenge of visualizing multistep instructional descriptions by retrieving and stitching video clips from a large corpus to form a coherent demonstration for a given sequence of steps. It introduces a procedure evaluator transformer, a weakly supervised data augmentation strategy with hard negatives to enforce step correctness and visual coherence, and an adaptive set-cover search to efficiently assemble clips from multiple videos. The method achieves state-of-the-art results across cooking, gardening, and woodworking domains, with substantial gains in recall and favorable human judgments compared to retrieval and generation baselines. This work advances instructional video synthesis and holds practical promise for education and robotics by enabling flexible, cross-video demonstrations that reflect complex procedural descriptions.
Abstract
When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.
