Table of Contents
Fetching ...

Cap2Sum: Learning to Summarize Videos by Generating Captions

Cairong Zhao, Chutian Wang, Zifan Song, Guosheng Hu, Haonan Chen, Xiaofan Zhai

TL;DR

Cap2Sum addresses the high labeling cost of video summarization by training a summarizer with dense video captions as weak supervision, enabling zero-shot and few-shot performance with target datasets. It introduces a two-component architecture (video summarizer and dense-caption-capable captioner) augmented by a CLIP Prior to bridge domain gaps between captioned and summarization videos. Two new captioned datasets, TVSum-Caption and SumMe-Caption, facilitate evaluation of caption-based fine-tuning and generalization. Experimental results show Cap2Sum achieves state-of-the-art performance in both zero-shot and fine-tuned settings, demonstrating strong generalization across datasets and practical applicability.

Abstract

With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.

Cap2Sum: Learning to Summarize Videos by Generating Captions

TL;DR

Cap2Sum addresses the high labeling cost of video summarization by training a summarizer with dense video captions as weak supervision, enabling zero-shot and few-shot performance with target datasets. It introduces a two-component architecture (video summarizer and dense-caption-capable captioner) augmented by a CLIP Prior to bridge domain gaps between captioned and summarization videos. Two new captioned datasets, TVSum-Caption and SumMe-Caption, facilitate evaluation of caption-based fine-tuning and generalization. Experimental results show Cap2Sum achieves state-of-the-art performance in both zero-shot and fine-tuned settings, demonstrating strong generalization across datasets and practical applicability.

Abstract

With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale datasets, leading to limited performance and generalization capacity. In this work, we introduce the use of dense video captions as a supervision signal to train video summarization models. Motivated by this, we propose Cap2Sum, a model that learns to summarize videos by generating captions, to exploit dense video caption annotations. This weakly-supervised approach allows us to train the models on large-scale dense video caption datasets to achieve better performance and generalization capacity. To further improve the generalization capacity, we introduce a CLIP (a strong vision-language model) Prior mechanism to enhance the learning of important objects that captions may ignore in the videos. In practice, Cap2Sum can perform zero-shot video summarization or be fine-tuned by the ground-truth summary or video caption of the target dataset. To examine the performance of Cap2Sum after weakly-supervised fine-tuning by the video captions, we propose two new datasets, TVSum-Caption and SumMe-Caption, which are derived from two common video summarization datasets and will be publicly released. We conduct extensive experiments and the results demonstrate that our method achieves significant improvements in performance and generalization capacity compared with previous methods.
Paper Structure (21 sections, 7 equations, 4 figures, 4 tables)

This paper contains 21 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) The overview of the proposed Cap2Sum framework. For a input video, the video frames are encoded to features by a pre-trained encoder. The features are fed to the summarizer to generate frame-wise summarization scores. These scores are used to weight the frame features, and are fed to the captioner to generate dense captions. As an auxiliary, a CLIP prior mechanism is proposed to improve the summarization. (b) Architecture of the summarizer in Cap2Sum. (c) Architecture of the captioner in Cap2Sum.
  • Figure 2: Pipeline of the CLIP Prior Generator. We employ CLIP to calculate the similarity between each frame and pre-defined texts. These similarities are clipped and post-processed to generate the CLIP Prior.
  • Figure 3: Statistics of the TVSum-Caption dataset and AcitivityNet-Caption dataset.
  • Figure 4: Visualization results with the captions generated by Cap2Sum on TVSum. The gray histogram shows the ground-truth importance scores for each frame and we color-code the corresponding captions, video frames, and importance scores.