Multi-Sentence Grounding for Long-term Instructional Video
Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie
TL;DR
The paper tackles the challenge of ground-truth temporal localization and descriptive step annotation in long instructional videos plagued by noisy ASR transcripts. It introduces HowToStep, built via WhisperX ASR and LLM-based summarization to produce descriptive procedural steps, and NaSVA, a Transformer-based multi-sentence grounding model that aligns steps to video segments with a two-stage timestamp refinement. Using HowToStep and NaSVA, the authors achieve state-of-the-art results on HT-Step, HTM-Align, and CrossTask benchmarks, outperforming previous methods by notable margins. The work also provides comprehensive ablations and public release of code and data, enabling scalable, high-quality video-text datasets for instructional video understanding.
Abstract
In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a large language model to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.
