Table of Contents
Fetching ...

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang

TL;DR

The paper addresses the scarcity of NL-FL aligned data for Lean4 theorem proving and presents TheoremLlama, an end-to-end framework that turns a general-purpose LLM into a Lean4 expert through NL-FL data generation (Open Bootstrapped Theorems, OBT), NL-FL bootstrapping, and targeted training. It combines block training, curriculum data sorting, and instruction fine-tuning with iterative proof writing to leverage NL reasoning for formal Lean4 proofs. Empirical results on MiniF2F-Lean4 show TheoremLlama achieving $36.48\%$ (Valid) and $33.61\%$ (Test), outperforming GPT-4-Turbo and Gemini baselines and surpassing tree-search methods, with ablation studies confirming the contribution of each component. The work provides a practical path to democratize Lean4 theorem proving with LLMs and opens datasets, model checkpoints, and code to the research community.

Abstract

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. However, due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data most modern LLMs exhibit suboptimal performance.This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address these challenges, this paper proposes TheoremLlama, an end-to-end framework that trains a general-purpose LLM to be a Lean4 expert. TheoremLlama includes NL-FL dataset generation and bootstrapping method to obtain aligned dataset, curriculum learning and block training techniques to train the model, and iterative proof writing method to write Lean4 proofs that work together synergistically. Using the dataset generation method in TheoremLlama, we provide Open Bootstrapped Theorems (OBT), an NL-FL aligned and bootstrapped dataset. Our novel NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leverages the NL reasoning ability of LLMs for formal reasoning. The TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. Our code, model checkpoints, and the generated dataset is published in GitHub

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

TL;DR

The paper addresses the scarcity of NL-FL aligned data for Lean4 theorem proving and presents TheoremLlama, an end-to-end framework that turns a general-purpose LLM into a Lean4 expert through NL-FL data generation (Open Bootstrapped Theorems, OBT), NL-FL bootstrapping, and targeted training. It combines block training, curriculum data sorting, and instruction fine-tuning with iterative proof writing to leverage NL reasoning for formal Lean4 proofs. Empirical results on MiniF2F-Lean4 show TheoremLlama achieving (Valid) and (Test), outperforming GPT-4-Turbo and Gemini baselines and surpassing tree-search methods, with ablation studies confirming the contribution of each component. The work provides a practical path to democratize Lean4 theorem proving with LLMs and opens datasets, model checkpoints, and code to the research community.

Abstract

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. However, due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data most modern LLMs exhibit suboptimal performance.This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address these challenges, this paper proposes TheoremLlama, an end-to-end framework that trains a general-purpose LLM to be a Lean4 expert. TheoremLlama includes NL-FL dataset generation and bootstrapping method to obtain aligned dataset, curriculum learning and block training techniques to train the model, and iterative proof writing method to write Lean4 proofs that work together synergistically. Using the dataset generation method in TheoremLlama, we provide Open Bootstrapped Theorems (OBT), an NL-FL aligned and bootstrapped dataset. Our novel NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leverages the NL reasoning ability of LLMs for formal reasoning. The TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. Our code, model checkpoints, and the generated dataset is published in GitHub
Paper Structure (37 sections, 3 equations, 2 figures, 15 tables)

This paper contains 37 sections, 3 equations, 2 figures, 15 tables.

Figures (2)

  • Figure 1: TheoremLlama Framework: (a) NL-FL Aligned Data Generation: We first extract Lean4 data from Mathlib4. Subsequently, we fine-tune a T5 encoder to search for the best examples to guide the informalization of the extracted data. Then, we apply Gemini-1.5 to informalize extracted theorems with retrieved examples. Finally, we perform NL-FL Bootstraping to integrate natural language reasoning into Lean4 codes. Using this generation method, we have the OBT dataset. (b) Lean4 Prover Training: We use block training to enhance the in-context ability and the curriculum data sorting to let LLM learn from easy to hard data. These techniques can make LLM better learn unfamiliar Lean4 theorem proving tasks. (c) Iterative Proof Writing: We iteratively use the correct proofs from the same dataset of the previous iterations as in-context examples to enhance the proof-writing ability of the LLM.
  • Figure 2: Histogram for all combinations of NL statement in example list and FL statement in Mathlib4