Table of Contents
Fetching ...

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

TL;DR

Stitch-a-Demo tackles the challenge of visualizing multistep instructional descriptions by retrieving and stitching video clips from a large corpus to form a coherent demonstration for a given sequence of steps. It introduces a procedure evaluator transformer, a weakly supervised data augmentation strategy with hard negatives to enforce step correctness and visual coherence, and an adaptive set-cover search to efficiently assemble clips from multiple videos. The method achieves state-of-the-art results across cooking, gardening, and woodworking domains, with substantial gains in recall and favorable human judgments compared to retrieval and generation baselines. This work advances instructional video synthesis and holds practical promise for education and robotics by enabling flexible, cross-video demonstrations that reflect complex procedural descriptions.

Abstract

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

TL;DR

Stitch-a-Demo tackles the challenge of visualizing multistep instructional descriptions by retrieving and stitching video clips from a large corpus to form a coherent demonstration for a given sequence of steps. It introduces a procedure evaluator transformer, a weakly supervised data augmentation strategy with hard negatives to enforce step correctness and visual coherence, and an adaptive set-cover search to efficiently assemble clips from multiple videos. The method achieves state-of-the-art results across cooking, gardening, and woodworking domains, with substantial gains in recall and favorable human judgments compared to retrieval and generation baselines. This work advances instructional video synthesis and holds practical promise for education and robotics by enabling flexible, cross-video demonstrations that reflect complex procedural descriptions.

Abstract

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Paper Structure

This paper contains 14 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Video demonstration from multistep descriptions. Given multistep descriptions (left) aiming to achieve a procedural task, e.g. making vegan taco, our method obtains clips from thousands of instructional videos to visually demonstrate the procedure (right). The goal is for every clip to correctly describe a step, while maintaining visual consistency. Our proposed method goes beyond current retrieval and generation methods, which cannot handle multistep descriptions.
  • Figure 2: Overview of the method. The videos and the step descriptions in $\mathcal{C}$ are used to create a procedure mapping $\mathcal{M}$, using step localization $F_T$. The procedure query $R$ and $\mathcal{M}$ give video candidates $V'_R$. The procedure evaluator$F_R$ outputs the likelihood of each candidate.
  • Figure 3: Examples of hard negatives and procedure combination. We design negative samples that violate step correctness, visual continuity, and object state continuity (left). We show an example of combining step descriptions from $n$ (here $n=2$) video demonstrations into a novel procedure, using an LLM llama-adapter (right). The novel procedure mixes steps from both descriptions
  • Figure 4: Qualitative results. Our method correctly visualizes the step descriptions (top), compared to prior work. The second to the fourth row shows representative outputs in cooking, woodworking, and gardening. Our method correctly shows video clips from two video sources. Each of the video source alone cannot correctly demonstrate all the step descriptions. The last row contains some failure cases, showing the difficulty of the task. Here each keyframe represents a clip $v$; see Supp. video for actual videos.
  • Figure 5: Search space reduction. Using the effective set cover algorithm, the ground truth (GT) is captured in the candidate set with high probability, even with small sample set sizes. See text.
  • ...and 3 more figures