Table of Contents
Fetching ...

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

TL;DR

This work proposes to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale, and introduces a prompting method that is able to take into account a longer text of subtitles, allowing for human-style video captions at scale without human supervision.

Abstract

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

TL;DR

This work proposes to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale, and introduces a prompting method that is able to take into account a longer text of subtitles, allowing for human-style video captions at scale without human supervision.

Abstract

Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.
Paper Structure (28 sections, 2 equations, 7 figures, 17 tables)

This paper contains 28 sections, 2 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: ASR subtitles deviate from human-written captions: they contain a lot of filler phrases, e.g., "we're going to", and extra information, e.g., "they make dog toothbrushes". We propose to generate human-style video captions based on the ASR subtitles and their timestamps that we further temporally realign with the video.
  • Figure 2: Schematic visualization of the proposed HowToCaption method. Obtained from the automatic speech recognition system (ASR), subtitles are divided into blocks that contain longer contextual information. A large pre-trained language model is then used to generate plausible video captions based on ASR subtitles, along with timestamps for each caption. These generated captions and timestamps are further post-processed to enhance their alignment to the video and filter out captions with low similarity to the corresponding video by leveraging a pre-trained text-video model.
  • Figure 3: Examples of video-captions pairs from our HowToCaption dataset. ASR subtitles with only noisy supervision for the video are converted from spoken- to written-language-style captions. Note that some details in the generated captions are taken from a longer context, see the supplement for a full example.
  • Figure D.1: Caption length statistics of our HowToCaption dataset. We randomly sample 5000 videos to plot the distributions.
  • Figure E.1: Extended example of video-captions pairs from our HowToCaption dataset (an extension of Fig. 3 of the main paper). The ASR subtitles within the corresponding video clip are bolded. We note that some details in the generated captions are derived from a long ASR context.
  • ...and 2 more figures