Table of Contents
Fetching ...

AutoPresent: Designing Structured Visuals from Scratch

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell

TL;DR

This work defines the NL-to-slide generation task and introduces SlidesBench, a large benchmark enabling reference-based and reference-free evaluation of slide design. It advocates programmatic slide construction via NL-to-code generation and introduces SlidesLib to simplify program generation, culminating in AutoPresent, an open-source 8B model that approaches GPT-4o in performance. Through extensive experiments and an iterative refinement pipeline, the authors show that code-generation-based approaches with modular tooling yield higher-quality, editable slides than end-to-end image-generation methods. The work establishes a foundation for automated, structured visuals and highlights avenues for future deck-level generation and enhanced design principles.

Abstract

Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.

AutoPresent: Designing Structured Visuals from Scratch

TL;DR

This work defines the NL-to-slide generation task and introduces SlidesBench, a large benchmark enabling reference-based and reference-free evaluation of slide design. It advocates programmatic slide construction via NL-to-code generation and introduces SlidesLib to simplify program generation, culminating in AutoPresent, an open-source 8B model that approaches GPT-4o in performance. Through extensive experiments and an iterative refinement pipeline, the authors show that code-generation-based approaches with modular tooling yield higher-quality, editable slides than end-to-end image-generation methods. The work establishes a foundation for automated, structured visuals and highlights avenues for future deck-level generation and enhanced design principles.

Abstract

Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.
Paper Structure (43 sections, 13 figures, 13 tables)

This paper contains 43 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Automatically generating slides from natural language instructions. We propose AutoPresent, a tool-augmented code generation method that follows natural language instructions to design slides from scratch, as shown in the examples. This allows for precise control over all elements, including textual content, images, visual layouts, coloring, and more.
  • Figure 2: Illustration of SlidesBench. Each example of SlidesBench consists of three instructions: Detailed Instructions with Images, Detailed Instructions Only, and High-Level Instructions. The model is tasked to generate a slide based on the instruction, and the generated slide is evaluated on the metrics suite, which contains both the reference-free metrics and the reference-based metrics.
  • Figure 3: Examples of slides generated by different methods in three scenarios. End-to-end image generation methods fail to generate structured and clear slides. Small open-sourced models like LlaMa and LlaVa can barely generate any usable slides, while AutoPresent produces quality slides. Adding SlidesLib improves GPT-4o's performance on detailed instruction only and high-level instruction tasks.
  • Figure 4: Perceptual evaluation results on detailed instruction (1) with images and (2) only settings. We ask the users to score the quality of each slide from 1-5 and report the average score of each model. The user reported preference on GPT-4o and AutoPresent compared with LlaMa, while still having a gap with human-designed slides.
  • Figure 5: Auto-refinement results with GPT-4o, where the model further addresses some previously neglected instructions (marked in green), such as shape, background color, and text.
  • ...and 8 more figures