LLMSTEP: LLM proofstep suggestions in Lean

Sean Welleck; Rahul Saha

LLMSTEP: LLM proofstep suggestions in Lean

Sean Welleck, Rahul Saha

TL;DR

The paper presents llmstep, a Lean 4 tactic that delegates proof-step suggestions to a language-model server and validates outputs within Lean. It introduces a baseline, fine-tuned Pythia 2.8b model (and compatibility with ReProver) and provides end-to-end implementation across CPU, GPU, and Colab runtimes, emphasizing a model-agnostic, prefix-based workflow. Through a rigorous evaluation using best-first search on Lean Dojo and miniF2F benchmarks, the approach demonstrates competitive proof-search performance and practical runtimes, highlighting sub-second GPU latency and viable CPU performance for smaller models. The work aims to lower barriers to LM-assisted formalization and lays groundwork for future faster inference and broader tactic-prediction capabilities within Lean 4 editor integrations.

Abstract

We present LLMSTEP, a tool for integrating a language model into the Lean proof assistant. LLMSTEP is a Lean 4 tactic that sends a user's proof state to a server hosting a language model. The language model generates suggestions, which are checked in Lean and displayed to a user in their development environment. We provide a baseline language model, along with code for fine-tuning and evaluation to support further development. We provide server implementations that run on CPU, a CUDA GPU, or a Google Colab notebook, as a step towards fast, effective language model suggestions for any user.

LLMSTEP: LLM proofstep suggestions in Lean

TL;DR

Abstract

Paper Structure (9 sections, 3 figures, 2 tables)

This paper contains 9 sections, 3 figures, 2 tables.

Introduction
Related Work
Approach
Implementation
Lean tactic.
Language model.
Evaluation
Conclusion
Acknowledgements

Figures (3)

Figure 1: llmstep is a tool for receiving proofstep suggestions from a language model. A user calls llmstep in a proof (left), which sends the current state of the proof and a prefix to a server. A language model generates proofstep suggestions which are checked in Lean and shown to the user (right). llmstep supports a variety of compute environments, with servers that run on CPU, CUDA GPU, or Google Colab. llmstep is implemented as a Lean 4 metaprogram with Python servers.
Figure 2: Runtime comparison of different compute environments and models supported by llmstep.
Figure 3: In this example, the user inputs an empty prefix. llmstep returns five suggestions. The first three suggestions each finish the proof, and the last two are valid.

LLMSTEP: LLM proofstep suggestions in Lean

TL;DR

Abstract

LLMSTEP: LLM proofstep suggestions in Lean

Authors

TL;DR

Abstract

Table of Contents

Figures (3)