Table of Contents
Fetching ...

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant, Camille Guinaudeau, Frédéric Dufaux

TL;DR

This work tackles the problem of summarizing long, text-heavy multimodal presentations using Vision-Language Models, examining how input representation and structure affect cost and performance. It builds a 822-presentation benchmark from the TIB dataset, evaluating open-weight models (notably Qwen2-VL) across unimodal and multimodal inputs, with a focus on interleaved slides-transcript representations and visual token budgets. Key contributions include a comprehensive dataset and benchmark, a fine-grained analysis of input representations on extractive statistics and relevance, and practical guidance showing that structured interleaved inputs extend the input-length Pareto frontier beyond slides alone. The study also discusses cross-modal interactions, limitations of current VLMs in handling conflicts between modalities, and suggests directions for richer training data to improve robustness and trust in multimodal summarization systems.

Abstract

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

TL;DR

This work tackles the problem of summarizing long, text-heavy multimodal presentations using Vision-Language Models, examining how input representation and structure affect cost and performance. It builds a 822-presentation benchmark from the TIB dataset, evaluating open-weight models (notably Qwen2-VL) across unimodal and multimodal inputs, with a focus on interleaved slides-transcript representations and visual token budgets. Key contributions include a comprehensive dataset and benchmark, a fine-grained analysis of input representations on extractive statistics and relevance, and practical guidance showing that structured interleaved inputs extend the input-length Pareto frontier beyond slides alone. The study also discusses cross-modal interactions, limitations of current VLMs in handling conflicts between modalities, and suggests directions for richer training data to improve robustness and trust in multimodal summarization systems.

Abstract

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

Paper Structure

This paper contains 22 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: VLMs are able to process a multimodal presentation in various unimodal and multimodal representations.
  • Figure 2: Average token count in speech transcript and OCR, and their overlap
  • Figure 3: Unimodal performance at different token budgets
  • Figure 4: The addition of structure improves the Rouge score compared to the transcript alone
  • Figure 5: Rouge score with different visual token budgets
  • ...and 6 more figures