SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Srivatsa Kundurthy; Clara Na; Michael Handley; Zach Kirshner; Chen Bo Calvin Zhang; Manasi Sharma; Emma Strubell; John Ling

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling

TL;DR

This work considers the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language, and introduces SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks.

Abstract

Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users' explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models' performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at https://spreadsheetarena.ai.

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

TL;DR

Abstract

Paper Structure (55 sections, 4 equations, 13 figures, 18 tables, 1 algorithm)

This paper contains 55 sections, 4 equations, 13 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Human Preferences.
Structured Artifact Generation.
Background
The Bradley-Terry Model
Elo-like Ratings from Strength Coefficients
Feature-augmented Bradley-Terry Models
SpreadsheetArena
Task Formulation
Our Approach
Arena Methodology
Results and Analysis
General Results
Spreadsheet Preferences vs. Code and Chat Settings
...and 40 more sections

Figures (13)

Figure 1: Elo ratings for 16 models ranked in SpreadsheetArena. Standard Elo scores are anchored on GPT-4o at 1000. Overall, Claude models are often preferred. In §\ref{['sec:analysis']} we contextualize these global rankings with observable feature-adjusted scores, category-specific analysis across prompts, characterization of failure modes in dispreferred spreadsheets, and expert evaluations in financial modeling use cases.
Figure 2: In SpreadsheetArena, users submit a prompt and are shown four pairwise battles between LLM-generated spreadsheet workbooks. Votes are blind, and users can indicate that one spreadsheet is preferred over the other, or that both are equally satisfactory or unsatisfactory. Workbooks can contain multiple sheets, and sheets often contain a mixture of text, values, and formulas, where cells may contain stylistic formatting (e.g., bold text or a fill color).
Figure 3: Pairwise win probability change ($\Delta P_{\text{win}}$) after adjusting for 29 spreadsheet features in the Bradley-Terry model.
Figure 4: Distribution of expert ratings across six evaluation dimensions for finance-domain spreadsheets ($n=134$ evaluations). Color Coding and Formatting stands out as the weakest dimension, with 77.6% of evaluations scoring 2 or below.
Figure 5: Elo ratings trend inwards after feature adjustment.
...and 8 more figures

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

TL;DR

Abstract

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Authors

TL;DR

Abstract

Table of Contents

Figures (13)