Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

Tomoya Sugihara; Shuntaro Masuda; Ling Xiao; Toshihiko Yamasaki

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, Toshihiko Yamasaki

TL;DR

This work reframes video summarization as a language task by generating frame captions with an image-captioning model and synthesizing a text summary with an LLM, then learning to align frame-level captions to the summary in a semantic space via a diversity-aware loss. The core contribution is the Preserving Diversity Loss (PDL), which combines a margin-based ranking term with a sparsity regularizer and adaptively balances them based on video diversity to produce concise, diverse summaries. The method achieves state-of-the-art rank correlations on SumMe and competitive results on TVSum, while enabling personalized, user-guided summaries through prompts to the LLMs. These results demonstrate a scalable, self-supervised approach that leverages vision-language models for robust video summarization with potential for user-specific customization.

Abstract

Current video summarization methods rely heavily on supervised computer vision techniques, which demands time-consuming and subjective manual annotations. To overcome these limitations, we investigated self-supervised video summarization. Inspired by the success of Large Language Models (LLMs), we explored the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. By leveraging the advantages of LLMs in context understanding, we aim to enhance the effectiveness of self-supervised video summarization. Our method begins by generating captions for individual video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the captions and the text summary. Notably, we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames with captions similar to the text summary. Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients. In addition, our method has a novel feature of being able to achieve personalized summarization.

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

TL;DR

Abstract

Paper Structure (31 sections, 12 equations, 12 figures, 3 tables)

This paper contains 31 sections, 12 equations, 12 figures, 3 tables.

Introduction
Related Work
Video Summarization
Text Summarization
Method
Framework of our method
Loss function
Impact of the loss functions
Proposed PDL
Experiments
Datasets
Implementation Details
Evaluation Metrics
Results
Model Analysis
...and 16 more sections

Figures (12)

Figure 1: Overview of our proposed framework. We take only videos as input and first generate captions from individual frames using a pre-trained image captioning model, the Generative Image-to-text Transformer (GIT) git. The text summary is then created by GPT-4 gpt4. The semantic distance between individual captions and the text summary is calculated using the proposed Preserving Diversity Loss (PDL) to calculate frame-level scores. Finally, the frame-level scores are aggregated into scene-level scores, and the knapsack problem is solved to select a subset of scenes, thereby creating the video summary.
Figure 2: The pipeline of the proposed semantic distance calculation module. We solve semantic textual similarity task between individual frame captions and the generated text summary to calculate frame-level scores using Siamese-Sentence-BERT architecture siamesesentence.
Figure 3: Effect of the fixed contribution value ($\alpha$) of the regularization loss sparsity. A higher value of Spearman's $\rho$ indicates a better video summary.
Figure 4: The histogram of $D$ for individual videos within the SumMe SumMe and TVSum TVSum datasets. A lower value indicates that the linguistically represented video has lower diversity. TVSum has more videos with higher $D$ scores compared to SumMe.
Figure 5: Examples of the transition of $s_{\rm{change}_k}$. The cosine similarity between adjacent scenes are shown. A lower value indicates a reduced similarity to the linguistically represented adjacent scene.
...and 7 more figures

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

TL;DR

Abstract

Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video

Authors

TL;DR

Abstract

Table of Contents

Figures (12)