Table of Contents
Fetching ...

RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux, Antoine Bosselut, Viktor Kunčak

TL;DR

RLMEval addresses a critical gap in evaluating automated reasoning for formal mathematics by focusing on research-level Lean mathematics and blueprint theorems from real-world projects. It defines two core tasks, Neural Theorem Proving and Proof Autoformalization, and provides Easy and Normal evaluation modes to simulate access to auxiliary lemmas, using a dataset of 613 theorems from 6 Lean blueprint projects. Empirical results show a substantial gap between RLMEval and traditional benchmarks, with the best model reaching only around 10% pass@128 on proof autoformalization in the more realistic normal mode, and modest gains when auxiliary lemmas are available. The findings highlight the importance of improved context handling, lemma discovery, and retrieval-augmented strategies for progress in formal mathematics, and RLMEval offers an extensible, multi-version framework to drive and benchmark such advances.

Abstract

Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

RLMEval: Evaluating Research-Level Neural Theorem Proving

TL;DR

RLMEval addresses a critical gap in evaluating automated reasoning for formal mathematics by focusing on research-level Lean mathematics and blueprint theorems from real-world projects. It defines two core tasks, Neural Theorem Proving and Proof Autoformalization, and provides Easy and Normal evaluation modes to simulate access to auxiliary lemmas, using a dataset of 613 theorems from 6 Lean blueprint projects. Empirical results show a substantial gap between RLMEval and traditional benchmarks, with the best model reaching only around 10% pass@128 on proof autoformalization in the more realistic normal mode, and modest gains when auxiliary lemmas are available. The findings highlight the importance of improved context handling, lemma discovery, and retrieval-augmented strategies for progress in formal mathematics, and RLMEval offers an extensible, multi-version framework to drive and benchmark such advances.

Abstract

Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

Paper Structure

This paper contains 10 sections, 3 theorems, 3 equations, 3 figures, 6 tables.

Key Result

Theorem 1

Given 'S : Solution', we have that '$\lambda$' does not divide 'S.Y'.

Figures (3)

  • Figure 1: Pass rate on RLMEval using pass@128 for neural theorem proving (left) and proof autoformalization (right), in Easy and Normal modes.
  • Figure 2: Scaling trends for the proof autoformalization task in normal mode on RLMEval for various models and pass@$k$ values.
  • Figure 3: Proof length distribution of the different evaluated models on RLMEval. Proof length is measured in lines of code, comments trimmed. Data aggregated from all experiments: NTP and proof autoformalization, normal and easy modes.

Theorems & Definitions (3)

  • Theorem
  • Theorem
  • Theorem