Table of Contents
Fetching ...

Multi-Sentence Grounding for Long-term Instructional Video

Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

The paper tackles the challenge of ground-truth temporal localization and descriptive step annotation in long instructional videos plagued by noisy ASR transcripts. It introduces HowToStep, built via WhisperX ASR and LLM-based summarization to produce descriptive procedural steps, and NaSVA, a Transformer-based multi-sentence grounding model that aligns steps to video segments with a two-stage timestamp refinement. Using HowToStep and NaSVA, the authors achieve state-of-the-art results on HT-Step, HTM-Align, and CrossTask benchmarks, outperforming previous methods by notable margins. The work also provides comprehensive ablations and public release of code and data, enabling scalable, high-quality video-text datasets for instructional video understanding.

Abstract

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.

Multi-Sentence Grounding for Long-term Instructional Video

TL;DR

The paper tackles the challenge of ground-truth temporal localization and descriptive step annotation in long instructional videos plagued by noisy ASR transcripts. It introduces HowToStep, built via WhisperX ASR and LLM-based summarization to produce descriptive procedural steps, and NaSVA, a Transformer-based multi-sentence grounding model that aligns steps to video segments with a two-stage timestamp refinement. Using HowToStep and NaSVA, the authors achieve state-of-the-art results on HT-Step, HTM-Align, and CrossTask benchmarks, outperforming previous methods by notable margins. The work also provides comprehensive ablations and public release of code and data, enabling scalable, high-quality video-text datasets for instructional video understanding.

Abstract

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.
Paper Structure (23 sections, 11 equations, 9 figures, 9 tables)

This paper contains 23 sections, 11 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: A comparison of the proposed HowToStep with annotations on HowTo100M. Our dataset consists of multiple descriptive steps, with the corresponding temporal windows. Compared to existing training data derived from ASR transcripts han2022temporal and task-related articles of Wikihow afouras2024htmavroudi2023learningchen2022weakly, HowToStep data offers the following advantages: 1) Descriptive: clearly describes the procedural action steps in the instructional video; 2) Concise: all sentences can be grounded in the video, without redundancy or noises; 3) Temporally well-aligned: offers precise temporal boundaries for procedural steps.
  • Figure 2: Schematic illustration of the proposed pipeline to summarizing noisy ASR transcripts into descriptive steps (left), while determining the start-end timestamp in the video (right). We utilize the Large Language Model (LLM) to summarize the narrations from ASR transcripts into descriptive steps. Afterwards, we roughly get the pseudo-label by chaining the 'steps$\rightarrow$ASR' similarity and 'ASR$\rightarrow$video' timestamp to train our multi-sentence grounding network NaSVA in Stage 1. Lastly, we use the trained model to refine the timestamp of the generated steps in Stage 2, resulting in an extra training source for multi-sentence grounding, named HowToStep.
  • Figure 3: Schematic visualization of the proposed multi-sentence grounding network termed NaSVA. The visual features are treated as key-value pairs while textual features as queries, to predict the alignment score matrix $\hat{\mathbb{A}}$ between video and texts.
  • Figure 4: Qualitative examples of manually annotated visually-aligned text-to-video alignment matrix $\mathbb{Y} \in \{0,1\}^{K \times T}$ and the learned text-to-video alignment score matrix $\hat{\mathbb{A}} \in \mathbb{R}^{K \times T}$ of the model output for samples from HTM-Align (left) and HT-Step (right). The ground truth timestamps of the example on HT-Step are labelled manually. Note that the temporal density and order of texts are quite different between the two tasks.
  • Figure 4: Ablation study for visual-textual backbones and ASR systems. Here we only use the weakly-aligned ASR transcripts to train our model.
  • ...and 4 more figures