Table of Contents
Fetching ...

Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean

Peiyang Song, Kaiyu Yang, Anima Anandkumar

TL;DR

The paper tackles the challenge of autonomous theorem proving in Lean by integrating large language models directly into the Lean environment as copilots. It introduces Lean Copilot, a framework that runs LLM inference natively in Lean via FFI, offering three tools—tactic suggestion, proof search, and premise selection—with support for user-supplied models and local or cloud deployment. Experimental results on Mathematics in Lean show substantial gains: autonomous proof search covers 74.2% of proof steps (85% better than the baseline aesop), and human-assisted proofs require only 2.08 manually entered tactics (versus 3.86 for aesop). The work provides open-source tooling and a foundation for broader neuro-symbolic reasoning in formal mathematics, enabling rapid prototyping and adoption of LLM-based proof automation within proof assistants.

Abstract

Neural theorem proving combines large language models (LLMs) with proof assistants such as Lean, where the correctness of formal proofs can be rigorously verified, leaving no room for hallucination. With existing neural theorem provers pretrained on a fixed collection of data and offering valuable suggestions at times, it is challenging for them to continually prove novel theorems in a fully autonomous mode, where human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a general framework for running LLM inference natively in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Lean users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Using Lean Copilot, we build LLM-based tools that suggest proof steps, complete proof goals, and select relevant premises. Experimental results on the Mathematics in Lean textbook demonstrate the effectiveness of our method compared to existing rule-based proof automation in Lean (aesop). When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop); when automating the theorem proving process, Lean Copilot automates 74.2% proof steps on average, 85% better than aesop (40.1%). We open source all code and artifacts under a permissive MIT license to facilitate further research.

Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean

TL;DR

The paper tackles the challenge of autonomous theorem proving in Lean by integrating large language models directly into the Lean environment as copilots. It introduces Lean Copilot, a framework that runs LLM inference natively in Lean via FFI, offering three tools—tactic suggestion, proof search, and premise selection—with support for user-supplied models and local or cloud deployment. Experimental results on Mathematics in Lean show substantial gains: autonomous proof search covers 74.2% of proof steps (85% better than the baseline aesop), and human-assisted proofs require only 2.08 manually entered tactics (versus 3.86 for aesop). The work provides open-source tooling and a foundation for broader neuro-symbolic reasoning in formal mathematics, enabling rapid prototyping and adoption of LLM-based proof automation within proof assistants.

Abstract

Neural theorem proving combines large language models (LLMs) with proof assistants such as Lean, where the correctness of formal proofs can be rigorously verified, leaving no room for hallucination. With existing neural theorem provers pretrained on a fixed collection of data and offering valuable suggestions at times, it is challenging for them to continually prove novel theorems in a fully autonomous mode, where human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a general framework for running LLM inference natively in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Lean users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Using Lean Copilot, we build LLM-based tools that suggest proof steps, complete proof goals, and select relevant premises. Experimental results on the Mathematics in Lean textbook demonstrate the effectiveness of our method compared to existing rule-based proof automation in Lean (aesop). When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop); when automating the theorem proving process, Lean Copilot automates 74.2% proof steps on average, 85% better than aesop (40.1%). We open source all code and artifacts under a permissive MIT license to facilitate further research.
Paper Structure (28 sections, 6 figures, 1 table)

This paper contains 28 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Lean Copilot provides a general framework for running LLM inference in Lean, either locally via CTranslate2 or on a server. This framework enables creating various proof automation tools, including those for tactic suggestion, proof search, and premise selection.
  • Figure 2: The frontend of Lean Copilot's suggest_tactics. The user imports Lean Copilot just as a regular Lean package, and uses suggest_tactics in a proof, which calls an LLM to generate tactic candidates. After filtering, tactics that transform the proof goals errorlessly (i.e. "tactic states" in Lean) are shown in the InfoView (the view on the right, a vital part in Lean's workflow). The one that finishes the proof alone is colored green; others shown in blue with their respective remaining goals.
  • Figure 3: The frontend of Lean Copilot's search_proof. When search_proof succeeds, the found proof is displayed in the InfoView. It is a complete proof that is able to reduce the remaining goals to "No goals".
  • Figure 4: Two examples of annotated premises. The premises are selected based on proof goals, and then annotated according to whether they are in scope (i.e., all necessary modules imported, ready to be used) or not. The top one is in scope, annotated with its type information and docstring; the bottom one is out of scope, annotated with the module name that needs importing to use the premise, along with its complete definition code.
  • Figure 5: We provide an interface for text-to-text generation. Users can specify a model to use and configure it for generation. The model is not fixed after initial definition. If users would like to change part of its configuration, they can easily change certain the corresponding fields without the need to declare a new one. In this example, the user defines $model_1'$ by changing the number of return sequences in $model_1$ from 1 to 4. As a result, in the InfoView, $4$ sequences are returned for the #eval statement when $model_1'$ is used for generation.
  • ...and 1 more figures