Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization
Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, Enhong Chen
TL;DR
The paper addresses cross-modality correlation in Multimodal Summarization with Multimodal Output (MSMO) by leveraging entity information. It proposes EGMS, a BART-based framework with a Shared Multimodal Encoder that jointly processes text, images, and entity cues via a Text-Image Encoder and an Entity-Image Encoder, a Multimodal Guided Decoder, and a Gated Knowledge Distillation component for image selection guided by a CLIP-based teacher. Entities are embedded via external knowledge graphs (e.g., TransE) and fused with visual features through a gating mechanism to produce coherent textual summaries and relevant image selections. Experiments on the MSMO dataset show state-of-the-art performance, and ablation studies validate the necessity of incorporating entity information for improved cross-modality understanding and summary quality.
Abstract
The rapid increase in multimedia data has spurred advancements in Multimodal Summarization with Multimodal Output (MSMO), which aims to produce a multimodal summary that integrates both text and relevant images. The inherent heterogeneity of content within multimodal inputs and outputs presents a significant challenge to the execution of MSMO. Traditional approaches typically adopt a holistic perspective on coarse image-text data or individual visual objects, overlooking the essential connections between objects and the entities they represent. To integrate the fine-grained entity knowledge, we propose an Entity-Guided Multimodal Summarization model (EGMS). Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently. A gating mechanism then combines visual data for enhanced textual summary generation, while image selection is refined through knowledge distillation from a pre-trained vision-language model. Extensive experiments on public MSMO dataset validate the superiority of the EGMS method, which also prove the necessity to incorporate entity information into MSMO problem.
