MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Lei Wang; Shan Dong; Yuhui Xu; Hanze Dong; Yalu Wang; Amrita Saha; Ee-Peng Lim; Caiming Xiong; Doyen Sahoo

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo

TL;DR

MathHay targets long-context mathematical reasoning in LLMs by automating data collection, question generation, quality control, and haystack construction to test up to $128K$ tokens. Eight top-performing LLMs are evaluated; even the best, Gemini-1.5-Pro-002, attains only about $51.26\%$ accuracy at $128K$, indicating substantial room for improvement. A hybrid evaluation using exact-match and a GPT-4o judge yields a strong correlation with human judgments ($\rho=0.9183$) between verified and unverified data, supporting automated scalability. The study demonstrates the difficulty of multi-step, multi-document math reasoning and offers a scalable benchmark framework for advancing long-context mathematics in LLMs.

Abstract

Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

TL;DR

MathHay targets long-context mathematical reasoning in LLMs by automating data collection, question generation, quality control, and haystack construction to test up to

tokens. Eight top-performing LLMs are evaluated; even the best, Gemini-1.5-Pro-002, attains only about

accuracy at

, indicating substantial room for improvement. A hybrid evaluation using exact-match and a GPT-4o judge yields a strong correlation with human judgments (

) between verified and unverified data, supporting automated scalability. The study demonstrates the difficulty of multi-step, multi-document math reasoning and offers a scalable benchmark framework for advancing long-context mathematics in LLMs.

Abstract

Paper Structure (39 sections, 18 figures, 3 tables)

This paper contains 39 sections, 18 figures, 3 tables.

Introduction
Related Work
Long-Context Benchmarks
Mathematical Reasoning Benchmarks
Benchmark Construction
Document Collection
Topic Generation.
Relevant Document Collection.
Document Filtering.
Question Generation
Single-Step, Single-Document Mathematical Reasoning Task (SSSD).
Multi-Step, Single-Document Mathematical Reasoning Task (MSSD).
Single-Step, Multi-Document Mathematical Reasoning Task (SSMD).
Multi-Step, Multi-Document Mathematical Reasoning Task (MSMD).
Quality Control
...and 24 more sections

Figures (18)

Figure 1: Overview of the framework for the automatic construction of the MathHay Benchmark. The upper section illustrates the document collection process, while the lower section outlines the stages of question generation, quality control, and haystack construction.
Figure 2: Accuracy of GPT-4o-mini on (a) single-document; (b) two-document; (c) three-document mathematical reasoning tasks from a subset of the MathHay Benchmark, with varying relevant document placements and input lengths.
Figure 2: Key statistics of MathHay.
Figure 3: Topic and task distribution. FMA: Financial Market Analysis, HCA: Healthcare Cost Analysis, UP: Urban Planning, EIA: Environmental Impact Assessment, SCM: Supply Chain Management, SA: Sports Analytics, ECA: Energy Consumption Analysis, REMT: Real Estate Market Trends, EF: Education Funding, AE: Agricultural Economics.
Figure 4: Performance of GPT-4o and GPT-4o-mini on single-document tasks (SSSD, MSSD) with varying placement depths and input lengths. The $y$-axis represents the depth of the relevant document. For example, $10$% depth indicates that the document is placed at the first $10$% of the input noisy text.
...and 13 more figures

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

TL;DR

Abstract

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (18)