SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Haoyu Dong; Jianbo Zhao; Yuzhang Tian; Junyu Xiong; Shiyu Xia; Mengyu Zhou; Yun Lin; José Cambronero; Yeye He; Shi Han; Dongmei Zhang

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, Dongmei Zhang

TL;DR

SpreadsheetLLM introduces SheetCompressor to enable large language models to efficiently process spreadsheets by leveraging structural anchors, inverted-index translation, and data-format-aware aggregation. The framework delivers substantial token reductions (up to 25x) and achieves state-of-the-art performance on spreadsheet table detection, with GPT-4-based results surpassing prior methods by over 12 percentage points. For downstream tasks, the Chain of Spreadsheet enables robust spreadsheet QA by decomposing tasks into table identification, boundary detection, and reasoning, yielding notable gains over existing baselines. The approach demonstrates strong generalization across model families and spreadsheet sizes, offering practical reductions in cost and enabling scalable spreadsheet understanding and interaction, while identifying avenues for incorporating richer format and semantic cues in future work.

Abstract

Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in the spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, and achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate it in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

TL;DR

Abstract

Paper Structure (65 sections, 9 equations, 10 figures, 12 tables, 2 algorithms)

This paper contains 65 sections, 9 equations, 10 figures, 12 tables, 2 algorithms.

Introduction
Related Work
Spreadsheet Representation
Spreadsheet Understanding
Spreadsheet Downstream Tasks
LLMs' Token Efficiency
Method
Vanilla Spreadsheet Encoding with Cell Value, Address, and Format
Structural-anchor-based Extraction
Inverted-index Translation
Data-format-aware Aggregation
Chain of Spreadsheet
Table Identification and Boundary Detection
Response Generation
Experiments
...and 50 more sections

Figures (10)

Figure 1: The SpreadsheetLLM pipeline.
Figure 2: Illustration of the SheetCompressor framework. The original spreadsheet contains two tables, featuring numerous data entries or hierarchical headers. The completed spreadsheet consists of 576 rows and 23 columns, with a vanilla encoding of 61,240 tokens. Initially, we first extract cells using structural anchors, rearranging them into a smaller 24×8 sheet. Subsequently, we perform index-invert, removing empty cells. Finally, we aggregate cells based on data formats, achieving an extremely compact representation of the spreadsheet with only 708 tokens.
Figure 3: Challenges of GPT4 understanding spreadsheet data.
Figure 4: GPT4 encoding methods and techniques for processing spreadsheet data.
Figure 5: Examples of three traditional spreadsheet encoding methods: Markdown, XML, and HTML. Due to space limitations, we only show the encoding of some cells.
...and 5 more figures

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

TL;DR

Abstract

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)