Table of Contents
Fetching ...

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Yuting Mei, Linli Yao, Qin Jin

TL;DR

The paper tackles BiSSV, a bimodal video summarization task that jointly produces a concise TM-Summary and an informative VM-Summary. It introduces the BIDS dataset and a unified end-to-end framework, UBiSS, which uses a saliency-aware encoder and a ranking-based objective to learn saliency across both modalities. A new joint metric, $NDCG_{MS}$, evaluates bimodal outputs by weighting salient segments, and experiments show UBiSS outperforms multi-stage and unimodal baselines while human studies confirm improved satisfaction and informativeness. Overall, the work advances multimodal video understanding by tightly coupling text and visual summaries and providing a scalable dataset and robust evaluation paradigm for BiSSV.

Abstract

With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

TL;DR

The paper tackles BiSSV, a bimodal video summarization task that jointly produces a concise TM-Summary and an informative VM-Summary. It introduces the BIDS dataset and a unified end-to-end framework, UBiSS, which uses a saliency-aware encoder and a ranking-based objective to learn saliency across both modalities. A new joint metric, , evaluates bimodal outputs by weighting salient segments, and experiments show UBiSS outperforms multi-stage and unimodal baselines while human studies confirm improved satisfaction and informativeness. Overall, the work advances multimodal video understanding by tightly coupling text and visual summaries and providing a scalable dataset and robust evaluation paradigm for BiSSV.

Abstract

With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, , to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.

Paper Structure

This paper contains 22 sections, 2 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of the Bimodal Semantic Summarization of Videos (BiSSV) task, which generates video summaries in both textual-modality and visual-modality.
  • Figure 2: Illustration of the data processing procedure comprising data merging, VM-Summary extraction, and data cleaning. Each step is discussed in Section \ref{['sec: datapro']}. The VM-Summary extraction algorithm is presented on the right side with colored numbers representing different scaling conditions: (1) Both adjacent segments are selected.(2) Only one adjacent segment is selected.(3) No adjacent segments is selected.
  • Figure 3: (a) Distribution of duration ratio between VM-Summary and original video; (b) Distribution of temporal positions of the segments selected into the VM-Summary in the original video.
  • Figure 4: Model Architecture of UBiSS.
  • Figure 5: Comparison of metrics for ranking similarity evlation kendall1945treatmentzwillinger1999crcjarvelin2002cumulated. Prediction A makes an incorrect prediction on two highest-scored segments, while Prediction B makes an incorrect prediction on two lowest-scored segments. Since Kendall's $\tau$ and Spearman's $\rho$ treat all segments equally while assessing ranking similarity, they both favor Prediction A. However, the incorrect prediction of A results in an inaccurate VM-Summary. Our proposed metric, NDCG@15%, prioritizes the ranking similarity of most salient segments and favors Prediction B more.
  • ...and 4 more figures