Table of Contents
Fetching ...

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

TL;DR

The paper analyzes the miniF2F benchmark through the lens of an end-to-end AI pipeline that reads informal problems, formalizes them in Lean, and proves them. It identifies major misalignments between informal and formal statements and introduces miniF2F-v2 with corrected informal/formal statements and proofs, reporting end-to-end accuracy of about $70\%$ on miniF2F-v2 versus $40\%$ on miniF2F-v1. It also reveals that autoformalization metrics reported in the literature overstate true correctness and highlights the impact of benchmark quality on evaluating ATP and autoformalization progress. The work argues for high-quality, discrepancy-free benchmarks and provides a public, MIT-licensed dataset to guide future advances in formal reasoning for Olympiad-style problems.

Abstract

We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.

miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

TL;DR

The paper analyzes the miniF2F benchmark through the lens of an end-to-end AI pipeline that reads informal problems, formalizes them in Lean, and proves them. It identifies major misalignments between informal and formal statements and introduces miniF2F-v2 with corrected informal/formal statements and proofs, reporting end-to-end accuracy of about on miniF2F-v2 versus on miniF2F-v1. It also reveals that autoformalization metrics reported in the literature overstate true correctness and highlights the impact of benchmark quality on evaluating ATP and autoformalization progress. The work argues for high-quality, discrepancy-free benchmarks and provides a public, MIT-licensed dataset to guide future advances in formal reasoning for Olympiad-style problems.

Abstract

We perform a thorough analysis of the formal and informal statements in the miniF2F benchmark from the perspective of an AI system that is tasked to participate in a math Olympiad consisting of the problems in miniF2F. In such setting, the model has to read and comprehend the problems in natural language, formalize them in Lean language, then proceed with proving the problems, and it will get credit for each problem if the formal proof corresponds to the original informal statement presented to the model. Our evaluation results reveal that the best accuracy of such pipeline can be about 36% using the SoTA models in the literature, considerably lower than the individual SoTA accuracies, 97% and 69% reported in the autoformalization and theorem proving literature. Analyzing the failure modes, we trace back a considerable portion of this drop to discrepancies between the formal and informal statements for more than half of the problems in miniF2F. We proceed with correcting all the errors, discrepancies and simplifications in formal and informal statements, and present the miniF2F-v2 with fully verified formal and informal statements and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to the best accuracy of 70%, a significant improvement from the 40% on the original miniF2F, yet indicating considerable misalignment between the autoformalization models and theorem provers. Our deep analysis suggests that a higher quality benchmark can help the community better evaluate progress in the field of formal reasoning and also better diagnose the failure and success modes of autoformalization and theorem proving models. Our dataset is available at https://github.com/roozbeh-yz/miniF2F_v2.

Paper Structure

This paper contains 31 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: High-level overview of a formal math prover system operating in a Math Olympiad setting.
  • Figure 2: Correction example of $\text{amc12b\_2021\_p3}$ problem in miniF2F across versions 1, 2s and 2c.
  • Figure 3: Distribution of solved Olympiad-level (IMO, AMC, AIME) competition problems present in miniF2F-v1/2s/2c across four theorem provers. Theorem prover names were shortened from their original names. Each bar plot also shows the total number of problems present in the dataset.
  • Figure 4: Pie charts of identified formal and informal statement errors within miniF2F benchmark across test and validation sets.