LLMSTEP: LLM proofstep suggestions in Lean
Sean Welleck, Rahul Saha
TL;DR
The paper presents llmstep, a Lean 4 tactic that delegates proof-step suggestions to a language-model server and validates outputs within Lean. It introduces a baseline, fine-tuned Pythia 2.8b model (and compatibility with ReProver) and provides end-to-end implementation across CPU, GPU, and Colab runtimes, emphasizing a model-agnostic, prefix-based workflow. Through a rigorous evaluation using best-first search on Lean Dojo and miniF2F benchmarks, the approach demonstrates competitive proof-search performance and practical runtimes, highlighting sub-second GPU latency and viable CPU performance for smaller models. The work aims to lower barriers to LM-assisted formalization and lays groundwork for future faster inference and broader tactic-prediction capabilities within Lean 4 editor integrations.
Abstract
We present LLMSTEP, a tool for integrating a language model into the Lean proof assistant. LLMSTEP is a Lean 4 tactic that sends a user's proof state to a server hosting a language model. The language model generates suggestions, which are checked in Lean and displayed to a user in their development environment. We provide a baseline language model, along with code for fine-tuning and evaluation to support further development. We provide server implementations that run on CPU, a CUDA GPU, or a Google Colab notebook, as a step towards fast, effective language model suggestions for any user.
