Table of Contents
Fetching ...

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Manolis Mylonas, Charalampia Zerva, Evlampios Apostolidis, Vasileios Mezaris

TL;DR

This work tackles script-driven video summarization by incorporating both the visual content of videos and their spoken transcripts, driven by a user-provided script. It introduces SD-MVSum, which employs two weighted cross-modal attention modules to fuse the script with visual and transcript content, followed by a Transformer-based scorer to produce frame-level relevance scores. The authors extend two large-scale datasets, S-VideoXum and MrHiHiSum, with textual descriptions and transcripts to support multimodal training and evaluation. Empirical results demonstrate gains over state-of-the-art script-driven and generic baselines on both datasets, validating the effectiveness of multimodal fusion and dynamic attention weighting for personalized video summarization.

Abstract

In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

TL;DR

This work tackles script-driven video summarization by incorporating both the visual content of videos and their spoken transcripts, driven by a user-provided script. It introduces SD-MVSum, which employs two weighted cross-modal attention modules to fuse the script with visual and transcript content, followed by a Transformer-based scorer to produce frame-level relevance scores. The authors extend two large-scale datasets, S-VideoXum and MrHiHiSum, with textual descriptions and transcripts to support multimodal training and evaluation. Empirical results demonstrate gains over state-of-the-art script-driven and generic baselines on both datasets, validating the effectiveness of multimodal fusion and dynamic attention weighting for personalized video summarization.

Abstract

In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video's spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the SD-MVSum network architecture. Given an input video, a user script about the content of the summary, and a set of audio transcripts, SD-MVSum produces a video summary by finding associations of the user script with both the visual and the spoken content in the video, using two weighted cross-modal attention mechanisms. The outputs of these mechanisms are concatenated and forwarded to a trainable Transformer-based scorer which computes frame-level importance scores. These scores are used by a frame/fragment selection component that forms the video summary given a video fragmentation and a time-budget about the summary duration.
  • Figure 2: The processing pipeline in the weighted cross-modal attention mechanism when fusing the visual and the script embeddings. The dynamic scaling of the attention weights is performed based on the computed cosine similarity matrix of the input embeddings.
  • Figure 3: Overview of the processing pipeline for creating the S-MrHiSum dataset.
  • Figure 4: An indicative sample from our qualitative analysis. The upper part provides a keyframe-based representation of the content of the full-length video, and the tabular structure beneath shows the utilized input data and the generated video summary by each method.