Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video
Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, Toshihiko Yamasaki
TL;DR
This work reframes video summarization as a language task by generating frame captions with an image-captioning model and synthesizing a text summary with an LLM, then learning to align frame-level captions to the summary in a semantic space via a diversity-aware loss. The core contribution is the Preserving Diversity Loss (PDL), which combines a margin-based ranking term with a sparsity regularizer and adaptively balances them based on video diversity to produce concise, diverse summaries. The method achieves state-of-the-art rank correlations on SumMe and competitive results on TVSum, while enabling personalized, user-guided summaries through prompts to the LLMs. These results demonstrate a scalable, self-supervised approach that leverages vision-language models for robust video summarization with potential for user-specific customization.
Abstract
Current video summarization methods rely heavily on supervised computer vision techniques, which demands time-consuming and subjective manual annotations. To overcome these limitations, we investigated self-supervised video summarization. Inspired by the success of Large Language Models (LLMs), we explored the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task. By leveraging the advantages of LLMs in context understanding, we aim to enhance the effectiveness of self-supervised video summarization. Our method begins by generating captions for individual video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the captions and the text summary. Notably, we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames with captions similar to the text summary. Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients. In addition, our method has a novel feature of being able to achieve personalized summarization.
