Table of Contents
Fetching ...

PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu

TL;DR

PPTBench addresses a gap in multimodal evaluation by providing a holistic benchmark for PowerPoint layout and design understanding. It unifies detection, understanding, modification, and generation tasks with a dual-input representation (structured JSON plus slide screenshots) derived from 958 PPTs, enabling rigorous assessment of visual-structural reasoning and API-guided manipulation. Across a diverse set of closed- and open-source LLMs, PPTBench reveals a persistent gap between semantic comprehension and layout-aware manipulation, with larger models and template-guided generation offering the strongest gains. The work also demonstrates the value of chain-of-thought prompts for generation and validates the reliability of LLM-based evaluation against human judgments, highlighting directions for future work in visual-structural reasoning and real-world slide automation.

Abstract

PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

TL;DR

PPTBench addresses a gap in multimodal evaluation by providing a holistic benchmark for PowerPoint layout and design understanding. It unifies detection, understanding, modification, and generation tasks with a dual-input representation (structured JSON plus slide screenshots) derived from 958 PPTs, enabling rigorous assessment of visual-structural reasoning and API-guided manipulation. Across a diverse set of closed- and open-source LLMs, PPTBench reveals a persistent gap between semantic comprehension and layout-aware manipulation, with larger models and template-guided generation offering the strongest gains. The work also demonstrates the value of chain-of-thought prompts for generation and validates the reliability of LLM-based evaluation against human judgments, highlighting directions for future work in visual-structural reasoning and real-world slide automation.

Abstract

PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

Paper Structure

This paper contains 57 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Task examples of the four categories in PPTBench on PowerPoint presentations.
  • Figure 2: Distribution of task categories in PPTBench, showing the number of samples under each sub-task.
  • Figure 3: Ablation study on selected tasks. Note that Image-only is not applicable for Modification, and Generation need no input source.
  • Figure 4: Effect of template-guided prompting on the Generation.
  • Figure 5: Case study of layout failure cases across four categories.
  • ...and 6 more figures