Scaling Up Video Summarization Pretraining with Large Language Models
Dawit Mureja Argaw, Seunghyun Yoon, Fabian Caba Heilbron, Hanieh Deilamsalehy, Trung Bui, Zhaowen Wang, Franck Dernoncourt, Joon Son Chung
TL;DR
This work tackles the data scarcity in long-form video summarization by introducing a scalable pretraining pipeline that leverages large language models (LLMs) as oracle summarizers to convert long-form, densely aligned videos into a large pseudo-ground truth dataset (LfVS-P) of 250K video–summary pairs. It also provides the LfVS-T benchmark with 1,200 professionally annotated long videos to evaluate progress in the field. The authors propose an autoregressive, multimodal Transformer-based model that fuses visual content and transcribed speech through cross-modal attention to generate concise summary videos, while remaining capable of operating with or without text. Experimental results show strong cross-dataset generalization and state-of-the-art performance on LfVS-T as well as traditional benchmarks like SumMe and TVSum, highlighting the effectiveness of large-scale pretraining for video summarization.
Abstract
Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.
