Table of Contents
Fetching ...

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

TL;DR

This work introduces a new cross-modal video summarization task that jointly produces a short video clip and a text narrative from a long video. It presents VideoXum, a large-scale reannotation of ActivityNet Captions with 14,001 long videos and 140,010 aligned video-text summary pairs, and proposes VTSUM-BLIP, an end-to-end framework with a hierarchical video encoder and dual decoders to generate synchronized visual and textual summaries. A new cross-modal evaluation metric, VT-CLIPScore, assesses semantic coherence between modalities, and experiments show the approach achieves strong performance on VideoXum and competitive results on existing single-modal datasets. The work provides a benchmark and baseline for future research in cross-modal video summarization and highlights directions for improving cross-modal coherence and evaluation metrics.

Abstract

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

VideoXum: Cross-modal Visual and Textural Summarization of Videos

TL;DR

This work introduces a new cross-modal video summarization task that jointly produces a short video clip and a text narrative from a long video. It presents VideoXum, a large-scale reannotation of ActivityNet Captions with 14,001 long videos and 140,010 aligned video-text summary pairs, and proposes VTSUM-BLIP, an end-to-end framework with a hierarchical video encoder and dual decoders to generate synchronized visual and textual summaries. A new cross-modal evaluation metric, VT-CLIPScore, assesses semantic coherence between modalities, and experiments show the approach achieves strong performance on VideoXum and competitive results on existing single-modal datasets. The work provides a benchmark and baseline for future research in cross-modal video summarization and highlights directions for improving cross-modal coherence and evaluation metrics.

Abstract

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
Paper Structure (21 sections, 6 equations, 7 figures, 8 tables)

This paper contains 21 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of the V2X-SUM tasks. A full-length source video ( bottom) can be summarized into a shortened video and a text narrative ( top). This task requires semantic alignment between the video and text summaries.
  • Figure 2: Statistical information of VideoXum dataset: (a) distribution of video length; (b) distribution of video length compression ratio; (c) distribution of normalized center timestamp; (d) distribution of length of text summary.
  • Figure 3: An overview of our VTSUM-BLIP framework ( left). It consists of a hierarchical video encoder ( middle), video-sum decoder, and text-sum decoder ( right). For V2V-SUM, the video-sum decoder employs a temporal Transformer and local self-attention module to aggregate the local context. For V2T-SUM, the text-sum decoder is a pretrained BLIP text decoder.
  • Figure 4: Two example results of the generated video and text summaries across different baseline models. Red (both line and box) indicates the results of the ground truth. Green indicates the results of the VTSUM-BLIP (Base). Blue indicates the results of VTSUM-BLIP (+TT+CA).
  • Figure 5: Impact of local window size $\varepsilon$ for Context Aggregation (CA) module.
  • ...and 2 more figures