Table of Contents
Fetching ...

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou

TL;DR

StatEval introduces the first large-scale, statistics-focused benchmark designed to rigorously evaluate large language models on both foundational knowledge and research-level proof tasks. It combines a scalable multi-agent data extraction pipeline with a robust, process-based scoring framework that separately handles multiple-choice and open-ended QA problems, enabling fine-grained assessment of reasoning and derivations. The benchmark spans 13,817 foundational problems and 2,374 research-level tasks across more than 30 subfields, with data drawn from textbooks, exams, and top-tier journals, and is complemented by open data and code. Experimental results show that closed-source models, particularly the GPT-5 family, outperform open-source counterparts, yet performance on frontier statistical reasoning remains significantly weaker than on foundational material, underscoring the challenge of rigorous statistical inference for current LLMs and establishing StatEval as a rigorous benchmark for progress in statistical intelligence.

Abstract

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

TL;DR

StatEval introduces the first large-scale, statistics-focused benchmark designed to rigorously evaluate large language models on both foundational knowledge and research-level proof tasks. It combines a scalable multi-agent data extraction pipeline with a robust, process-based scoring framework that separately handles multiple-choice and open-ended QA problems, enabling fine-grained assessment of reasoning and derivations. The benchmark spans 13,817 foundational problems and 2,374 research-level tasks across more than 30 subfields, with data drawn from textbooks, exams, and top-tier journals, and is complemented by open data and code. Experimental results show that closed-source models, particularly the GPT-5 family, outperform open-source counterparts, yet performance on frontier statistical reasoning remains significantly weaker than on foundational material, underscoring the challenge of rigorous statistical inference for current LLMs and establishing StatEval as a rigorous benchmark for progress in statistical intelligence.

Abstract

Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57\% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

Paper Structure

This paper contains 30 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of StatEval, illustrating the Foundational Knowledge Dataset, Advanced Statistical Research Dataset, and example evaluations on tasks such as statistical hypothesis testing and asymptotic properties of estimators.
  • Figure 2: Disciplinary classification of foundational-level datasets
  • Figure 3: Disciplinary classification of statistical research datasets
  • Figure 4: Overview of the StatEval data processing pipeline. Each agent corresponds to a major functional stage in the automated extraction and verification process.
  • Figure 5: Examples and scoring procedures in StatEval
  • ...and 2 more figures