DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

Daesik Jang; Morgan Lindsay Heisler; Linzi Xing; Yifei Li; Edward Wang; Ying Xiong; Yong Zhang; Zhenan Fan

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

TL;DR

DECKBench introduces a standardized benchmark for evaluating end-to-end academic slide generation and multi-turn editing. It provides a curated paper–slide dataset (294 pairs) and a simulated user framework to assess content fidelity, layout quality, and editability across slide- and deck-level metrics, plus multi-turn revisions. A modular multi-agent baseline (Outline, Code, Editor) demonstrates end-to-end capabilities and reveals failure modes, guiding future improvements in parsing, content selection, layout, and iterative refinement. The framework enables reproducible comparisons among generation and editing agents, with public code and data to advance research in multi-modal, interactive presentation systems. Overall, DECKBench fills a critical gap by unifying generation and editing evaluation under realistic, repeatable workflows with rich, multi-level metrics.

Abstract

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

TL;DR

Abstract

Paper Structure (41 sections, 1 equation, 5 figures, 5 tables)

This paper contains 41 sections, 1 equation, 5 figures, 5 tables.

Introduction
Background and Related Work
Design of the Benchmark Dataset
Scope and Use Cases
Data Collection and Curation
Simulated User Editing
Evaluation Protocol and Metrics
Slide-Level Metrics
Slide-Level Evaluation Overview
Reference-Based Metrics
Reference-Free Metrics
Deck-Level Metrics
Multi-Turn Evaluation (Slide Editing / Interactive Agents)
Multi-Agent Baseline System Architecture
Experimental Setup
...and 26 more sections

Figures (5)

Figure 1: Overview of the multi-turn slide editing evaluation pipeline. A user simulation agent interacts with the editing agent to iteratively refine the HTML slide deck, with each iteration evaluated against ground-truth decks using metrics from Section \ref{['sec:multi-turn-eval']}. The complete slide generation pipeline (Fig. \ref{['fig:slide-overview']}) is omitted for brevity.
Figure 2: Overview of the slide generation pipeline (in Section \ref{['sec:baseline_']}). User instructions are processed by an outline agent that extracts key paper content and generates a slide outline, which is then converted by a code agent into compilable HTML slides.
Figure 3: (a) Baseline-relative $\Delta$DTW for different models and personas. (b) Baseline-relative $\Delta$ Transition Similarity for different models and personas.
Figure 4: Example of a slide before (left) and after (right) applying the user simulated prompt.
Figure 5: Example of an added slide after applying the user simulated prompt.

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

TL;DR

Abstract

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)