Multimodal Abstractive Summarization for How2 Videos
Shruti Palaskar, Jindrich Libovický, Spandana Gella, Florian Metze
TL;DR
This work tackles open-domain video summarization by producing fluent textual summaries that fuse video content and speech transcripts. It proposes a multimodal abstractive framework with hierarchical attention to integrate text and video features and introduces Content F1 as a semantic, content-focused evaluation metric. Experiments on the How2 dataset show that multimodal (text+video) models outperform unimodal baselines, with transfer learning enabling cross-dataset gains. The results highlight the value of content-centric evaluation for teaser-style video summaries and point to future directions for end-to-end audio-based and multi-video summarization.
Abstract
In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.
