Efficient Pre-training for Localized Instruction Generation of Videos

Anil Batra; Davide Moltisanti; Laura Sevilla-Lara; Marcus Rohrbach; Frank Keller

Efficient Pre-training for Localized Instruction Generation of Videos

Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

TL;DR

Procedural videos pose two challenges: precisely localizing steps and generating coherent instructions, which typically requires large, noisy pre-training data. The authors introduce Sieve & Swap to automatically curate a smaller, higher-quality pre-training dataset by sieving ASR transcripts and swapping in human-written recipe steps, reducing data size by orders of magnitude. They further propose ProcX, a set-based transformer with key-aware deformable attention, a contrastive learning setup, and an IoU-aware confidence mechanism to jointly localize steps and generate instructions. Trained on the curated data, ProcX achieves state-of-the-art results on YouCook2 and Tasty with far less pre-training data, demonstrating strong data efficiency and potential for domain transfer in instructional video understanding.

Abstract

Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance but demands significant computational resources. Furthermore, transcripts contain irrelevant content and differ in style from human-written instructions. To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset. The resulting dataset is three orders of magnitude smaller than current web-scale datasets but enables efficient training of large-scale models. Alongside Sieve-&-Swap, we propose Procedure Transformer (ProcX), a model for end-to-end step localization and instruction generation for procedural videos. When pre-trained on our curated dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty while using a fraction of the training data. We have released code and dataset.

Efficient Pre-training for Localized Instruction Generation of Videos

TL;DR

Abstract

Paper Structure (35 sections, 11 equations, 17 figures, 5 tables)

This paper contains 35 sections, 11 equations, 17 figures, 5 tables.

Introduction
Related Work
The Task: Localized Instruction Generation
Sieve & Swap
Objective
Dataset Creation
Source Datasets.
Sieving Videos.
Sieve & Swap ASR Segments.
Procedure Transformer (ProcX)
Preliminary: Set-based Localization and Captioning.
Key-aware Deformable Attention.
Contrastive Transformer.
IoU-aware Confidence Score (gVFL).
Experimental Setup
...and 20 more sections

Figures (17)

Figure 1: (a) Sieve & Swap at a glance. We first remove irrelevant video ASR (Automatic Speech Recognition) segments, then substitute the raw transcripts with recipe steps retrieved from a recipe database. The resulting dataset is used for pre-training. Previous work uses raw and noisier transcripts, thus requiring larger amounts of data for effective pre-training. (b) With Sieve & Swap we generate a smaller but better pre-training dataset. Compared to using a larger number of raw segments (right), models achieve higher performance with a fraction of the data (middle). Note how the improvement from no pre-training (left) is steeper with Sieve & Swap. The plotted metric is the generated instruction coherency metric (SODA-C fujita2020soda), models were fine-tuned and tested on YouCook2 ZhXuCoCVPR18.
Figure 2: Sieve & Swap: A technique to sieve and swap ASR transcripts with human written instructions. Top: we first sieve videos and recipes matching their titles and text content. Bottom: for each ASR transcript in each filtered video, we obtain a text embedding and look for the nearest neighbor among the text embeddings of each recipe step from the filtered recipes. If the retrieved neighbor is too far away (gray dots) we consider the transcript segment irrelevant and discard it. Otherwise (colored dots) we replace it with the recipe step corresponding to the retrieved neighbor.
Figure 3: Attention visualization of one head of the visual encoder. Attention is computed between multi-scale video features and sampled points across scales. In vanilla deformable attention (a), the attention collapses to focus on a single sampled point and fails to attend to the entire video features. With our modified key-aware deformable attention (b), the attention weights spread across the video and help to utilize the entire video context to learn better multi-scale video features.
Figure 4: (a) Kernel Density Estimation for Content words in raw ASR transcripts and our Sieve & Swap dataset. (b) Frequency histogram for a sample of words.
Figure 5: (a) Top 50 recipes from the Sieve & Swap dataset containing 4K unique recipes. (b) Distribution of the number of words per instruction sentence.
...and 12 more figures

Efficient Pre-training for Localized Instruction Generation of Videos

TL;DR

Abstract

Efficient Pre-training for Localized Instruction Generation of Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (17)