Table of Contents
Fetching ...

Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut

TL;DR

This work tackles the core challenge of evaluating statement autoformalization by introducing a CPU-friendly symbolic metric (BEq+) and validating it against human judgments. It also releases ProofNetVerif as a robust metric benchmark and two new benchmarks, ProofNet# and RLM25, to cover undergraduate and research-level content, respectively. Through systematic experiments across multiple models, the authors show that extensive sampling combined with type-check filtering substantially boosts autoformalization accuracy and highlight the importance of context in research-level tasks. These resources establish a reliable foundation for measuring progress and guiding future development in autoformalization systems.

Abstract

Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.

Reliable Evaluation and Benchmarks for Statement Autoformalization

TL;DR

This work tackles the core challenge of evaluating statement autoformalization by introducing a CPU-friendly symbolic metric (BEq+) and validating it against human judgments. It also releases ProofNetVerif as a robust metric benchmark and two new benchmarks, ProofNet# and RLM25, to cover undergraduate and research-level content, respectively. Through systematic experiments across multiple models, the authors show that extensive sampling combined with type-check filtering substantially boosts autoformalization accuracy and highlight the importance of context in research-level tasks. These resources establish a reliable foundation for measuring progress and guiding future development in autoformalization systems.

Abstract

Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
Paper Structure (44 sections, 7 figures, 15 tables, 1 algorithm)

This paper contains 44 sections, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: Evolution of BEq+ and BEq$_L$ pass@$n$ scores for different temperatures on top of type-check filtering. We evaluate Llemma 7B on ProofNet# validation split.
  • Figure 2: Evolution of BEq+ metric for different selection methods on top of type-check filtering. We evaluate Llama3 8B, Llemma 7B, and Llemma 34B on ProofNet# validation split with a number of candidate samples up to $n = 1000$. Given its quadratic scaling with $n$ and high computational cost, the symbolic equivalence method is limited to $n \leq 50$ candidate samples.
  • Figure 3: Type-Check rate and Accuracy scaling trends with respect to the number of samples on ProofNet# validation split using 12-shot prompting and a sampling-based method (type-check filter + Self-BLEU). The number of samples varies from n = 1 to 50 (where Llemma 34B has top type-check rate and GPT-4o top accuracy). Numbers are in \ref{['tab:results_sampling_scaling']}.
  • Figure 4: Type-Check Filtering Ablation Study. Accuracy scores are reported on ProofNet# validation split for various ablations of sampling-based methods. More details and exact numbers are reported in \ref{['tab:results_ablation_study']}.
  • Figure 5: BEq+ pass@n scaling trends with respect to the number of samples on ProofNet# validation split and RLM25 using in-context learning prompting. We vary the number of candidate samples from n = 1 to 50.
  • ...and 2 more figures