PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Yiduo Guo; Zekai Zhang; Yaobo Liang; Dongyan Zhao; Nan Duan

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Nan Duan

TL;DR

The paper introduces the PPTC benchmark to evaluate large language models on complex PowerPoint task completion in multi-turn, multi-modal settings, and presents the PPTX-Match evaluation system for outcome-based assessment. It benchmarks 3 closed-source and 6 open-source LLMs, finding GPT-4 to be the strongest but still struggles with long sessions, template complexity, and spatial reasoning. The study identifies three core error sources—error accumulation, long templates, and multimodal perception—and shows limited gains from planning and content/API selection strategies in session-based tasks. It also analyzes the impact of model size and dialogue history, and provides data, code, and tools to support future research in AI-assisted office tasks.

Abstract

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

TL;DR

Abstract

Paper Structure (37 sections, 7 figures, 4 tables)

This paper contains 37 sections, 7 figures, 4 tables.

Introduction
PPTC Benchmark
Benchmark Overview
Benchmark Collection
PPTX-Match Evaluation System
Benchmark Statistics Analysis
Algorithms
Planning Algorithms
Selection Algorithms
Experiments
Large Language Models Selected for Evaluation
Experimental Setup
Turn-Based and Session-Based Evaluations
Implementation Details
Main results
...and 22 more sections

Figures (7)

Figure 2: We illustrate how LLMs complete one turn in a session. (A) To prompt the LLM, we provide it with the current instruction, previous instructions (dialogue history), PPT file content, and the API reference file. 'PPT reader' is a function that transforms the PPT file into the text-based format as the PPT file content. (B) The LLM then generates the API sequence and executes it to obtain the prediction PPT file. (C) We evaluate attributes and position relations in the prediction file.
Figure 3: Statistics for PPTC. a) Session turn number distribution. b) Instruction API number distribution (tokens). c) Distribution of instructions involving Chart, Table, Picture, and Position. Instructions involving 'Position' need the system to conduct position-related operations based on the understanding of spatial information. Note that one instruction may involve multiple different modalities.
Figure 4: The inference prompt we used in both turn-based and session-based evaluation settings. In the turn-based evaluation, we assess the LLM's performance for the current turn and assume the LLM has correctly finished previous turns. We then use feasible API sequences of previous turns as the AI response in the dialogue history and parse the label file of previous turns as the PPT file content. In the session-based evaluation, we evaluate the completion of the entire session and do not assume the LLM has correctly finished previous turns. We use the LLM's generated API sequences as the response and parsed the LLM prediction file as the PPT file content.
Figure 5: We illustrate the analysis results of the creating new PPT file task (task 1) and the editing PPT template task (task 2). In sub-figure (a), we report the average turn-based accuracy for instructions involving chart, table, picture, position, and pure text. We don't draw the accuracy of task 2 as no chart instruction in this task. In sub-figure (b), we report the ratio of four common errors made by GPT-4. In sub-figure (c), we report the accuracy with the model size. We don't plot the session-based accuracy of task 2 as it is zero.
Figure 6: The reference API file: part 1.
...and 2 more figures

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

TL;DR

Abstract

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)