Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant; Camille Guinaudeau; Frédéric Dufaux

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant, Camille Guinaudeau, Frédéric Dufaux

TL;DR

This work tackles the problem of summarizing long, text-heavy multimodal presentations using Vision-Language Models, examining how input representation and structure affect cost and performance. It builds a 822-presentation benchmark from the TIB dataset, evaluating open-weight models (notably Qwen2-VL) across unimodal and multimodal inputs, with a focus on interleaved slides-transcript representations and visual token budgets. Key contributions include a comprehensive dataset and benchmark, a fine-grained analysis of input representations on extractive statistics and relevance, and practical guidance showing that structured interleaved inputs extend the input-length Pareto frontier beyond slides alone. The study also discusses cross-modal interactions, limitations of current VLMs in handling conflicts between modalities, and suggests directions for richer training data to improve robustness and trust in multimodal summarization systems.

Abstract

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

TL;DR

Abstract

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)