SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Zeyao Ma; Bohan Zhang; Jing Zhang; Jifan Yu; Xiaokang Zhang; Xiaohan Zhang; Sijia Luo; Xi Wang; Jie Tang

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang

TL;DR

A comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.

Abstract

We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 27 figures, 5 tables)

This paper contains 24 sections, 2 equations, 27 figures, 5 tables.

Introduction
SpreadsheetBench
Task Formulation
Benchmark Construction
Benchmark Statistics
Evaluation Metrics
Experiments
Conclusion
Broader Discussion
Limitation
Potential Impact
Ethical Consideration
Maintenance Plan
Details of Data Collection
Data Source
...and 9 more sections

Figures (27)

Figure 1: Comparison of previous SheetCopilotBench and our benchmark. The instructions in our benchmark are more complex, reflecting real user demands, issues encountered, previous user attempts, and an output example. The spreadsheets in our benchmark organize data more flexibly. The figure provides a summary of the various data organization types used in our spreadsheets, with three non-standard relational tables featuring nested header, incomplete header and missing header, in addition to cells containing pure textual information and non-textual elements like colors.
Figure 2: The benchmark construction pipeline and OJ-style evaluation of our benchmark.
Figure 3: Key statistics of SpreadsheetBench.
Figure 3: Overall soft restriction of GPT-4o on different subsets (%).
Figure 4: Impact of input row size and round number in terms of overall soft restriction.
...and 22 more figures

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

TL;DR

Abstract

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (27)