Table of Contents
Fetching ...

Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques

Sangjun Han, Taeil Hur, Youngmi Hur, Kathy Sangkyung Lee, Myungyoon Lee, Hyojae Lim

TL;DR

The paper tackles AI-assisted formal proof generation by pairing a publicly accessible language model with a formal prover, Lean, using simple search strategies to prove problems in miniF2F. It introduces two proof-search regimes, $b$-search and $d$-search, plus a Bad($O$) feedback mechanism, and demonstrates competitive performance without any fine-tuning, including a peak of $pass@\$k\$ = 31.15\$ on miniF2F Lean when combining the two approaches. Cross-dataset and cross-LLM evaluations (ProofNet, AMC 12, and Llemma variants) show robustness, with results such as 13.14% on ProofNet and 28.28% on Llemma-based combos, indicating broader applicability beyond miniF2F. Overall, the work provides a practical, reproducible pathway toward AI-assisted formal proofs and motivates further research on integrating LLMs with formal proof systems.

Abstract

The challenge of formal proof generation has a rich history, but with modern techniques, we may finally be at the stage of making actual progress in real-life mathematical problems. This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs, with a particular focus on the miniF2F dataset. We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation. Despite its simplicity, our best-performing Lean-based model surpasses all known benchmarks with a 31.15% pass rate. We extend our experiments to include other datasets and employ alternative language models, showcasing our models' comparable performance in diverse settings and allowing for a more nuanced analysis of our results. Our findings offer insights into AI-assisted formal proof generation, suggesting a promising direction for future research in formal mathematical proof.

Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques

TL;DR

The paper tackles AI-assisted formal proof generation by pairing a publicly accessible language model with a formal prover, Lean, using simple search strategies to prove problems in miniF2F. It introduces two proof-search regimes, -search and -search, plus a Bad() feedback mechanism, and demonstrates competitive performance without any fine-tuning, including a peak of k\ on miniF2F Lean when combining the two approaches. Cross-dataset and cross-LLM evaluations (ProofNet, AMC 12, and Llemma variants) show robustness, with results such as 13.14% on ProofNet and 28.28% on Llemma-based combos, indicating broader applicability beyond miniF2F. Overall, the work provides a practical, reproducible pathway toward AI-assisted formal proofs and motivates further research on integrating LLMs with formal proof systems.

Abstract

The challenge of formal proof generation has a rich history, but with modern techniques, we may finally be at the stage of making actual progress in real-life mathematical problems. This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs, with a particular focus on the miniF2F dataset. We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation. Despite its simplicity, our best-performing Lean-based model surpasses all known benchmarks with a 31.15% pass rate. We extend our experiments to include other datasets and employ alternative language models, showcasing our models' comparable performance in diverse settings and allowing for a more nuanced analysis of our results. Our findings offer insights into AI-assisted formal proof generation, suggesting a promising direction for future research in formal mathematical proof.

Paper Structure

This paper contains 18 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Examples considering numbers of attempts taken by dChatLean+
  • Figure 2: The process of generating a proof for "mathd_algebra_171" using the Bad(O)-based feedback algorithm. The highlighted lines indicate the trials selected to be further illustrated in Figure \ref{['fig:case_study_2_2']}. The final proof is given in green, while other selected trials are in blue.
  • Figure 3: Selected proof trials highlighted in blue, also displayed in Figure \ref{['fig:case_study_2_1']}. The final proof is highlighted in green.
  • Figure 4: The proof generated by bChatLean for Bernoulli's inequality