Table of Contents
Fetching ...

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee

TL;DR

This work proposes TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level, and introduces MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities.

Abstract

The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

TL;DR

This work proposes TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level, and introduces MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities.

Abstract

The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive experiments demonstrate that TripleSumm achieves state-of-the-art performance, outperforming existing methods by a significant margin on four benchmarks, including MoSu. Our code and dataset are available at https://github.com/smkim37/TripleSumm.
Paper Structure (42 sections, 6 equations, 9 figures, 19 tables)

This paper contains 42 sections, 6 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: Illustration of Dynamic Importance of Modalities. Text is most salient at (a), while visual-audio are dominant at (b). At (c), all three contribute significantly. This highlights the necessity for an adaptive model that dynamically weighs saliency of each modality frame-by-frame.
  • Figure 2: Overall architecture of TripleSumm. Visual, text, and audio features are first encoded and linearly projected, then aggregated into fusion tokens, refined through the Multi-scale Temporal block (MST, lower left), and fused in the Cross-modal Fusion block (CMF, lower right). The fused representation is passed through a prediction head to generate frame-level importance scores.
  • Figure 3: Qualitative Examples on MoSu. The graph in the middle visualizes the fusion token's attention weights, illustrating how our model dynamically estimates the saliency of each modality and thus maintains strong summarization accuracy even when some modalities are missing.
  • Figure II: Detailed Statistics of the MoSu Dataset. Transcript density means the average ratio of video duration with valid text.
  • Figure II: Performance evaluation by thresholding modality attention weights. The figure illustrates the model performance, measured by (a) Kendall's $\tau$ and (b) Spearman's $\rho$, when a threshold $\theta$ is applied to the Fusion Token's learned modality attention weights. The Over $\theta$ line shows the performance when only weights $\geq \theta$ are retained, while the Under $\theta$ line shows performance when only weights $\leq \theta$ are retained.
  • ...and 4 more figures