Scaling Up Video Summarization Pretraining with Large Language Models

Dawit Mureja Argaw; Seunghyun Yoon; Fabian Caba Heilbron; Hanieh Deilamsalehy; Trung Bui; Zhaowen Wang; Franck Dernoncourt; Joon Son Chung

Scaling Up Video Summarization Pretraining with Large Language Models

Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung

TL;DR

This work tackles the data scarcity in long-form video summarization by introducing a scalable pretraining pipeline that leverages large language models (LLMs) as oracle summarizers to convert long-form, densely aligned videos into a large pseudo-ground truth dataset (LfVS-P) of 250K video–summary pairs. It also provides the LfVS-T benchmark with 1,200 professionally annotated long videos to evaluate progress in the field. The authors propose an autoregressive, multimodal Transformer-based model that fuses visual content and transcribed speech through cross-modal attention to generate concise summary videos, while remaining capable of operating with or without text. Experimental results show strong cross-dataset generalization and state-of-the-art performance on LfVS-T as well as traditional benchmarks like SumMe and TVSum, highlighting the effectiveness of large-scale pretraining for video summarization.

Abstract

Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.

Scaling Up Video Summarization Pretraining with Large Language Models

TL;DR

Abstract

Paper Structure (34 sections, 11 equations, 3 figures, 5 tables)

This paper contains 34 sections, 11 equations, 3 figures, 5 tables.

Introduction
Contributions.
Related Works
Text Summarization.
Video Summarization.
Scalable Dataset for Video Summarization
Source Data.
Prompting LLMs for Extractive Text Summarization.
Pseudo-Ground Truth Video Summary.
LfVS-T Benchmark.
Methodology
Problem Formulation.
Video Summarization Network
Long Video Encoding.
Long Text Encoding.
...and 19 more sections

Figures (3)

Figure 1: Scalable Dataset for Video Summarization. Given a long-form video with dense speech-to-video alignment, we first use a speech-to-text model bain2023whisperx to transcribe the video. Next, we preprocess the text so that each sentence in the transcript is accompanied by its corresponding start timestamp. We then prompt an LLM openai2023gpttouvron2023llama to extract the most critical and informative moments from the video along with their timestamp. After extracting the textual summary, we map it back to the relevant video segments to compose a pseudo-ground truth summary. Following this pipeline, we generate a large-scale dataset of video-summary pairs for video summarization pretraining.
Figure 2: Prompt Engineering. We formulate a prompt instructing an LLM to perform an extractive text summarization task. We explicitly emphasize not paraphrasing the wording in the extracted sentences and retaining their timestamps. This ensures seamless matching of the text summary back to the input video.
Figure 3: Video Summarization Network. We use a pretrained CLIP radford2021learning model to represent an input video as a sequence of visual tokens. Similarly, we use a pretrained sentence encoder liu2019roberta to encode the long text corpus. In the absence of associated text, we utilize a special MASK token as the text input. We then use a stack of transformer encoders to contextualize the visual and textual features. Next, we incorporate multi-modal cues from the contextualized features via cross-modal attention. Finally, a summary decoder takes the multi-modal features as input and autoregressively decodes the visual representation of the segments that will compose a video summary.

Scaling Up Video Summarization Pretraining with Large Language Models

TL;DR

Abstract

Scaling Up Video Summarization Pretraining with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)