Table of Contents
Fetching ...

LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover

Zijian Wu, Jiayu Wang, Dahua Lin, Kai Chen

TL;DR

Lean-GitHub addresses the scarcity of formal theorem proving data by mining Lean 4 code from GitHub to create LEAN-GitHub. The authors train InternLM2-StepProver on this data and show state-of-the-art performance on Lean 4 benchmarks (miniF2F, ProofNet, Putnam). They detail a scalable pipeline including improved data extraction and de-duplication during tree search, with ablations confirming the value of human written Lean data beyond Mathlib. The work open-sources LEAN-GitHub and the trained model, enabling broader exploration of automated formal reasoning across diverse mathematical topics.

Abstract

Recently, large language models have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate that our proposed dataset is beneficial for formal reasoning on a wide range of math topics. We open-source our model at https://GitHub. com/InternLM/InternLM-Math and our data at https://huggingface.co/ datasets/InternLM/Lean-GitHub

LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover

TL;DR

Lean-GitHub addresses the scarcity of formal theorem proving data by mining Lean 4 code from GitHub to create LEAN-GitHub. The authors train InternLM2-StepProver on this data and show state-of-the-art performance on Lean 4 benchmarks (miniF2F, ProofNet, Putnam). They detail a scalable pipeline including improved data extraction and de-duplication during tree search, with ablations confirming the value of human written Lean data beyond Mathlib. The work open-sources LEAN-GitHub and the trained model, enabling broader exploration of automated formal reasoning across diverse mathematical topics.

Abstract

Recently, large language models have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate that our proposed dataset is beneficial for formal reasoning on a wide range of math topics. We open-source our model at https://GitHub. com/InternLM/InternLM-Math and our data at https://huggingface.co/ datasets/InternLM/Lean-GitHub
Paper Structure (13 sections, 1 equation, 8 figures, 5 tables)

This paper contains 13 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An IMO problem solved by InternLM2-StepProver .
  • Figure 2: The pipeline for constructing LEAN-GitHub.
  • Figure 3: Word Cloud for the theorem names in LEAN-GitHub.
  • Figure 4: Top 30 repositories with most theorems extracted.
  • Figure 5: Examples of (input, output) pairs of our training prompt.
  • ...and 3 more figures