Table of Contents
Fetching ...

Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach

Sambaran Bandyopadhyay, Himanshu Maheshwari, Anandhavelu Natarajan, Apoorv Saxena

TL;DR

This work addresses the challenge of generating narrative, multimodal presentation slides from long documents by introducing DocPres, a multi-stage end-to-end framework that combines LLMs and vision-language models. The approach builds a hierarchical, bird's-eye view of the document, generates a coherent slide outline, maps slides to document sections for grounding, and then creates slide text and image selections in a staged manner to maintain flow and reduce hallucinations. Evaluations on the SciDuet dataset show DocPres outperforms several single-shot baselines on automated metrics like Coverage and Perplexity, with human evaluators rating DocPres higher on readability, consistency, and overall usability. The results demonstrate the practical value of decomposing a complex, long-context task into well-defined subtasks, enabling more reliable narrative slide generation without requiring task-specific training data.

Abstract

Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.

Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach

TL;DR

This work addresses the challenge of generating narrative, multimodal presentation slides from long documents by introducing DocPres, a multi-stage end-to-end framework that combines LLMs and vision-language models. The approach builds a hierarchical, bird's-eye view of the document, generates a coherent slide outline, maps slides to document sections for grounding, and then creates slide text and image selections in a staged manner to maintain flow and reduce hallucinations. Evaluations on the SciDuet dataset show DocPres outperforms several single-shot baselines on automated metrics like Coverage and Perplexity, with human evaluators rating DocPres higher on readability, consistency, and overall usability. The results demonstrate the practical value of decomposing a complex, long-context task into well-defined subtasks, enabling more reliable narrative slide generation without requiring task-specific training data.

Abstract

Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
Paper Structure (22 sections, 1 equation, 1 figure, 9 tables)

This paper contains 22 sections, 1 equation, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Comparison of DocPres (in green) with a conventional way of generating a presentation directly using an LLM (in blue).