Table of Contents
Fetching ...

Personalized Video Summarization by Multimodal Video Understanding

Brian Chen, Xiangyuan Zhao, Yingnan Zhu

TL;DR

This work proposes a new benchmark for video summarization that captures various user preferences and presents a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models to avoid the need to train a video summarization system on a large training dataset.

Abstract

Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. The pipeline takes both video and closed captioning as input and performs semantic analysis at the scene level by converting video frames into text. Subsequently, the user's genre preference was used as the basis for selecting the pertinent textual scenes. The experimental results demonstrate that our proposed pipeline outperforms current state-of-the-art unsupervised video summarization models. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models. In the end, the runtime analysis demonstrates that our pipeline is more suitable for practical use when scaling up the number of user preferences and videos.

Personalized Video Summarization by Multimodal Video Understanding

TL;DR

This work proposes a new benchmark for video summarization that captures various user preferences and presents a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models to avoid the need to train a video summarization system on a large training dataset.

Abstract

Video summarization techniques have been proven to improve the overall user experience when it comes to accessing and comprehending video content. If the user's preference is known, video summarization can identify significant information or relevant content from an input video, aiding them in obtaining the necessary information or determining their interest in watching the original video. Adapting video summarization to various types of video and user preferences requires significant training data and expensive human labeling. To facilitate such research, we proposed a new benchmark for video summarization that captures various user preferences. Also, we present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization that is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset. The pipeline takes both video and closed captioning as input and performs semantic analysis at the scene level by converting video frames into text. Subsequently, the user's genre preference was used as the basis for selecting the pertinent textual scenes. The experimental results demonstrate that our proposed pipeline outperforms current state-of-the-art unsupervised video summarization models. We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models. In the end, the runtime analysis demonstrates that our pipeline is more suitable for practical use when scaling up the number of user preferences and videos.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall structure of VSL. VSL utilizes a captioning model to transform long videos into text and summarize the text to represent the video.
  • Figure 2: Dataset creation pipeline. (a) We employ the CLIP features to compute the similarity between the genre to each frame in zero-shot manner and then aggregate the results to obtain the distribution of genre labels at the scene level. (b) We filter the genre labels based on the movie genre annotations provided by GT and a threshold for the length of the genre-specific summarization. (c) We selected the top 15% of scenes with the highest confidence score for a particular genre to include in the video summarization. (d) For each genre query, we created a video summarization that focuses on scenes belonging to that genre. (e) In the multi-genre summarization, we combine the confidence scores across different genres to determine the overall confidence score.
  • Figure 3: The detailed structure for VSL consists of three main steps. (a) In the first step, the input movie's audio and video backbones undergo processing through multimodal scene detection. (b) In the second step, semantic analysis is performed, which generates a score for each scene based on the results of multimodal scene detection. (c) The final step involves video summarization, where the scenes with the highest scores are selected to generate the summary video.
  • Figure 4: The process of multimodal scene detection involves the use of separate algorithms for video and audio backbones. These backbones analyze the input data independently to detect scenes. The resulting scene detection outcomes from each backbone are then aligned and combined at the frame level to obtain the final multimodal scene detection results.
  • Figure 5: Runtime analysis of # of videos and preferences.