UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Yuting Mei; Linli Yao; Qin Jin

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Yuting Mei, Linli Yao, Qin Jin

TL;DR

The paper tackles BiSSV, a bimodal video summarization task that jointly produces a concise TM-Summary and an informative VM-Summary. It introduces the BIDS dataset and a unified end-to-end framework, UBiSS, which uses a saliency-aware encoder and a ranking-based objective to learn saliency across both modalities. A new joint metric, $NDCG_{MS}$, evaluates bimodal outputs by weighting salient segments, and experiments show UBiSS outperforms multi-stage and unimodal baselines while human studies confirm improved satisfaction and informativeness. Overall, the work advances multimodal video understanding by tightly coupling text and visual summaries and providing a scalable dataset and robust evaluation paradigm for BiSSV.

Abstract

With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

TL;DR

Abstract

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)