Table of Contents
Fetching ...

PresentAgent: Multimodal Agent for Presentation Video Generation

Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao

TL;DR

The paper introduces PresentAgent, a modular system that converts long-form documents into narrated presentation videos by segmenting content, planning slide layouts, generating oral narration, and synchronizing audio with visuals. It also presents PresentEval, a two-path evaluation framework combining objective quiz-based comprehension and subjective VL-based quality assessments, along with the Doc2Present benchmark of 30 document–presentation pairs. Experimental results show PresentAgent variants approaching human-level performance in both factual understanding and viewer-perceived quality, underscoring the viability of integrated multimodal generation for accessible, explainable content. The work advances the state of document-to-presentation generation by coupling language, vision, and speech components within a controllable, evaluable pipeline.

Abstract

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

PresentAgent: Multimodal Agent for Presentation Video Generation

TL;DR

The paper introduces PresentAgent, a modular system that converts long-form documents into narrated presentation videos by segmenting content, planning slide layouts, generating oral narration, and synchronizing audio with visuals. It also presents PresentEval, a two-path evaluation framework combining objective quiz-based comprehension and subjective VL-based quality assessments, along with the Doc2Present benchmark of 30 document–presentation pairs. Experimental results show PresentAgent variants approaching human-level performance in both factual understanding and viewer-perceived quality, underscoring the viability of integrated multimodal generation for accessible, explainable content. The work advances the state of document-to-presentation generation by coupling language, vision, and speech components within a controllable, evaluable pipeline.

Abstract

We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of PresentAgent. It takes documents (e.g., web pages) as input and follows a generation pipeline: (1) document processing, (2) structured slide generation, (3) synchronized caption creation, and (4) audio synthesis. The final output is a presentation video combining visual slides with aligned narration. The purple-highlighted middle results emphasize the system's key transitional outputs during generation.
  • Figure 2: Document Diversity in Our Evaluation Benchmark.
  • Figure 3: Overview of our framework. Our approach addresses the full pipeline of document-to-presentation video generation and evaluation. Left: Given diverse input documents—including papers, websites, blogs, slides, and PDFs—PresentAgent generates narrated presentation videos by producing synchronized slide decks with audio. Right: To evaluate these videos, we introduce PresentEval, a two-part evaluation framework: (1) Objective Quiz Evaluation (top), which measures factual comprehension using Qwen-VL; and (2) Subjective Scoring (bottom), which uses vision-language models to rate content quality, visual design, and audio comprehension across predefined dimensions.
  • Figure 4: Overview of the PresentAgent framework. Our system takes diverse documents (e.g., papers, websites, PDFs) as input and follows a modular generation pipeline. It first performs outline generation (Step 1) and retrieves the most suitable template (Step 2), then generates slides and narration notes via a vision-language model (Step 3). The notes are converted into audio via TTS and composed into a presentation video (Step 4). To evaluate video quality, we design multiple prompts (Step 5) and feed them into a VLM-based scoring pipeline (Step 6) that outputs dimension-specific metrics.
  • Figure 5: PresentAgent Demo. Automatically generates academic-style slides and narrated videos from research papers, streamlining the transformation from written content to engaging visual presentations.