RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux; Antoine Bosselut; Viktor Kunčak

RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux, Antoine Bosselut, Viktor Kunčak

TL;DR

RLMEval addresses a critical gap in evaluating automated reasoning for formal mathematics by focusing on research-level Lean mathematics and blueprint theorems from real-world projects. It defines two core tasks, Neural Theorem Proving and Proof Autoformalization, and provides Easy and Normal evaluation modes to simulate access to auxiliary lemmas, using a dataset of 613 theorems from 6 Lean blueprint projects. Empirical results show a substantial gap between RLMEval and traditional benchmarks, with the best model reaching only around 10% pass@128 on proof autoformalization in the more realistic normal mode, and modest gains when auxiliary lemmas are available. The findings highlight the importance of improved context handling, lemma discovery, and retrieval-augmented strategies for progress in formal mathematics, and RLMEval offers an extensible, multi-version framework to drive and benchmark such advances.

Abstract

Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

RLMEval: Evaluating Research-Level Neural Theorem Proving

TL;DR

Abstract

RLMEval: Evaluating Research-Level Neural Theorem Proving

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)