Table of Contents
Fetching ...

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

Jan Ravnik, Matjaž Ličen, Felix Bührmann, Bithiah Yuan, Felix Stinson, Tanvi Singh

TL;DR

FinSheet-Bench is introduced, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks and reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications.

Abstract

While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets. Progress is held back by the lack of real industry fund portfolio datasets for benchmarking, as private equity data rooms are confidential. To address this, we introduce FinSheet-Bench, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks. Our evaluation of ten model configurations from OpenAI, Google, and Anthropic on financial spreadsheets, including complex layouts, fund dividers, and multi-line column names, reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications. The best-performing model, Gemini 3.1 Pro, achieves 82.4% accuracy across twenty-four evaluation files of varying complexity and structural layout (approximately 1 error per 6 questions), followed by GPT-5.2 with reasoning at 80.4%, Claude Opus 4.6 with thinking at 80.2%, and Gemini 3 Pro at 80.2%. Performance degrades substantially on larger, more complex spreadsheets: the largest spreadsheet (152 companies, 8 funds) yields an average accuracy of just 48.6% across all models, compared to 86.2% on the easiest evaluation file. These difficulty patterns are consistent across all ten models, indicating that they reflect LLM limitations rather than idiosyncratic model weaknesses. Reliable financial spreadsheet extraction will likely require architectural approaches that separate document understanding from deterministic computation.

FinSheet-Bench: From Simple Lookups to Complex Reasoning, Where LLMs Break on Financial Spreadsheets

TL;DR

FinSheet-Bench is introduced, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks and reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications.

Abstract

While Large Language Models (LLMs) can accelerate text-heavy tasks in alternative investment due diligence, a gap remains in their ability to accurately extract and reason over structured tabular data from complex financial spreadsheets. Progress is held back by the lack of real industry fund portfolio datasets for benchmarking, as private equity data rooms are confidential. To address this, we introduce FinSheet-Bench, a benchmark of synthetic financial portfolio data modeled on real private equity fund structures, designed to evaluate LLM performance on text-serialized spreadsheet question answering and numeric reasoning tasks. Our evaluation of ten model configurations from OpenAI, Google, and Anthropic on financial spreadsheets, including complex layouts, fund dividers, and multi-line column names, reveals that no standalone model achieves error rates low enough for unsupervised use in professional finance applications. The best-performing model, Gemini 3.1 Pro, achieves 82.4% accuracy across twenty-four evaluation files of varying complexity and structural layout (approximately 1 error per 6 questions), followed by GPT-5.2 with reasoning at 80.4%, Claude Opus 4.6 with thinking at 80.2%, and Gemini 3 Pro at 80.2%. Performance degrades substantially on larger, more complex spreadsheets: the largest spreadsheet (152 companies, 8 funds) yields an average accuracy of just 48.6% across all models, compared to 86.2% on the easiest evaluation file. These difficulty patterns are consistent across all ten models, indicating that they reflect LLM limitations rather than idiosyncratic model weaknesses. Reliable financial spreadsheet extraction will likely require architectural approaches that separate document understanding from deterministic computation.
Paper Structure (56 sections, 9 figures, 8 tables)

This paper contains 56 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: LLM accuracy on financial spreadsheet tasks plotted against model release date. Bubble size indicates average response time per question. At 82.4%, even the best model produces approximately 1 error per 6 questions. Newer models and models with reasoning generally achieve higher accuracy. See Table \ref{['tab:overall_results']} for detailed results.
  • Figure 2: LLM accuracy by question category (stacked bars). The dark base shows pooled accuracy across all ten models; the middle segment adds the gain when restricting to the top 3 models; the lightest segment adds the further gain of the single best-performing model per category (labeled above each bar). Simple lookups achieve high accuracy, but performance drops for tasks requiring calculation or multi-step reasoning. Sorting (37.5% all models, 83% best) is particularly challenging given that list extraction achieves 85.5%; models can identify what is in a list but struggle with ordering by magnitude. See text for a detailed discussion of tokenization artifacts vs. reasoning failures.
  • Figure 3: Accuracy vs. task complexity level (stacked bars). The dark base shows pooled accuracy across all ten models; the middle segment adds the gain when restricting to the top 3 models; the lightest segment adds the further gain of the single best-performing model per complexity level (labeled above each bar). Low-complexity tasks achieve $\sim$89--95% accuracy, but performance drops sharply at High and Very High complexity, where tasks require multi-step calculations, aggregations, or sorting. Even the best individual model drops to 47.8% on Very High complexity.
  • Figure 4: Model performance comparison by question type across all ten model configurations (OpenAI, Google, Anthropic).
  • Figure 5: Heatmap of accuracy by model and question type across all ten model configurations (4 OpenAI, 4 Google, 2 Anthropic). Green indicates high accuracy; red indicates failure. Results span twenty-four evaluation files (approximately 500 questions per model; some files contribute fewer questions due to missing columns; see Section 4.2 for per-file question counts).
  • ...and 4 more figures