LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet; Fabian Gloeckle; Amaury Hayat

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat

TL;DR

This work establishes an updatable benchmark evaluating models directly on the latest research results in mathematics, consisting of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit.

Abstract

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

TL;DR

Abstract

accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.

Paper Structure (26 sections, 1 theorem, 9 equations, 3 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 1 theorem, 9 equations, 3 figures, 5 tables, 2 algorithms.

Introduction
Problem Definition: Context Dependence in Extracted Theorems
Method
Data extraction
Full-context retrieval.
Vector Retrieval.
Self-Containedness Judge
Evaluation
Human evaluation
Cross-model evaluation
Results
Model Evaluation
Protocol for Proof Generation and Multi-Judge Evaluation
Generator.
Independent LLM-as-a-judge.
...and 11 more sections

Key Result

Lemma F.1

$\mathcal{L}_{T}: L^2_{\alpha-1}(0,T;U)\rightarrow L^{2}(\mathfrak{S},H;\mu)$ is a Hilbert--Schmidt operator.

Figures (3)

Figure 1: Pipeline at a glance: the pipeline retrieves papers from the arXiv, extract statements and make them self-contained. An external prover can then try to prove them step by step and each proof is inspected by a judge. The final score is the proportion of valid proofs. We used GPT-5, Gemini 2.5 pro, Gemini 3 pro, Claude 4.5 Opus, and Deepseek-R as a baseline for provers.
Figure 2: Percentage results relative to the total lemmas retrieved for both extraction modes (full context / vector retrieval) and models
Figure 3: DAG : input $\to$ parsing $\to$ normalization $\to$ extraction (double mode) $\to$ self contained $\to$ proof steps $\to$ judge proofs $\to$ dataset.

Theorems & Definitions (1)

Lemma F.1

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

TL;DR

Abstract

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)