Table of Contents
Fetching ...

SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation

Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, Chenyu You

TL;DR

This work tackles automated generation of scientific presentation slides from research papers by introducing SlideGen, a modular, six-agent pipeline that grounds content to visual layouts within a rich template library and a visual-in-the-loop. It advances evaluation through Geometry-Aware Density (GAD) and SlideQA, linking layout aesthetics with content faithfulness. Empirical results show SlideGen achieving state-of-the-art performance across multiple baselines and aligning with human judgments, while analysis reveals prompt design and backbone choices significantly impact content richness and efficiency. The approach promises scalable, design-aware scientific communication and generalizes to other domains via extensible layouts and templates.

Abstract

Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.

SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation

TL;DR

This work tackles automated generation of scientific presentation slides from research papers by introducing SlideGen, a modular, six-agent pipeline that grounds content to visual layouts within a rich template library and a visual-in-the-loop. It advances evaluation through Geometry-Aware Density (GAD) and SlideQA, linking layout aesthetics with content faithfulness. Empirical results show SlideGen achieving state-of-the-art performance across multiple baselines and aligning with human judgments, while analysis reveals prompt design and backbone choices significantly impact content richness and efficiency. The approach promises scalable, design-aware scientific communication and generalizes to other domains via extensible layouts and templates.

Abstract

Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.

Paper Structure

This paper contains 31 sections, 19 equations, 59 figures, 9 tables, 1 algorithm.

Figures (59)

  • Figure 1: Overview of SlideGen pipeline. The multi-agent framework comprises six specialized agents that sequentially process a scientific paper via content planning, figure selection, layout design, equation integration, visual refinement, and narration generation.
  • Figure 2: Overview of the template library and representative slide outputs. The left and right panel follow the same structure: the left side shows a subset of the slide template library used by Arranger; the right side shows two representative slides generated with those templates. Four slides are shown in total, produced with templates T3, T4, T14, and T16. Each template addresses a typical presentation structure (e.g., text-only, image-left, two-column). Throughout the paper, we adopt 16:9 as the default deck aspect ratio, while users are free to modify the template library’s size and aspect ratio. The complete collection is provided in the Appendix Section \ref{['sec:Template_Library']}.
  • Figure 3: Comparison of generated slides with block abstractions. Each slide is shown as colored blocks, revealing that prior methods largely converge to similar vertical layouts, while SlideGen produces more varied and visually structured designs.
  • Figure 4: Color adjustment method on two fixed-hue planes for Refiner. Examples on the left and right illustrate failure cases and the final readable and high-contrast choice.
  • Figure 5: Example slides generated via SlideGen using GPT-4o (a) and GPT-5 (b) with the default deep blue theme. Additional samples are shown in Appendix Section \ref{['sec:slidegen_sample']}. We use structured prompt templates for all agent calls, and the full prompts for all agents are provided in Appendix Section \ref{['sec:prompts']}.
  • ...and 54 more figures