Table of Contents
Fetching ...

SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

Sicheng Liu, Lintao Wang, Xiaogang Zhu, Xuequan Lu, Zhiyong Wang, Kun Hu

TL;DR

SITransformer addresses extreme multimodal summarization by extracting and leveraging cross-modal shared salient information to suppress topic-irrelevant content. The architecture combines a differentiable top-k based noise filter with gating, CLIP-based multimodal embedding, and cross-modal attentions to generate a one-sentence text summary and a cover-frame, trained with Wasserstein-based distribution losses and a fluency term. It achieves state-of-the-art results on BBC News video-document data for both text and video outputs, and ablations validate the importance of shared information and the NFDT gating mechanism. The work demonstrates that explicit cross-modal information grounding can significantly improve extreme multimodal summarization quality and provides a publicly available implementation.

Abstract

Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.

SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

TL;DR

SITransformer addresses extreme multimodal summarization by extracting and leveraging cross-modal shared salient information to suppress topic-irrelevant content. The architecture combines a differentiable top-k based noise filter with gating, CLIP-based multimodal embedding, and cross-modal attentions to generate a one-sentence text summary and a cover-frame, trained with Wasserstein-based distribution losses and a fluency term. It achieves state-of-the-art results on BBC News video-document data for both text and video outputs, and ablations validate the importance of shared information and the NFDT gating mechanism. The work demonstrates that explicit cross-modal information grounding can significantly improve extreme multimodal summarization quality and provides a publicly available implementation.

Abstract

Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.
Paper Structure (18 sections, 20 equations, 5 figures, 1 table)

This paper contains 18 sections, 20 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: An example of irrelevant noises (in red box) and shared salient information (in green box).
  • Figure 2: The overall structure of the SITransformer, which consists of a multimodal embedding module, a cross-modal shared information extractor, a cross-modal interaction module and modality-specific extreme summarization decoders.
  • Figure 3: Noise Filter with Differentiable Top-k (NFDT).
  • Figure 4: Qualitative examples by our NFDT and the state-of-the-art methods: TopicCAT tang2023topiccat, TLDW tang2023tldw and CLIP radford2021learning.
  • Figure 5: Impacts of shared salient information amount