Towards a Mathematics Formalisation Assistant using Large Language Models

Ayush Agrawal; Siddhartha Gadgil; Navin Goyal; Ashvni Narayanan; Anand Tadipatri

Towards a Mathematics Formalisation Assistant using Large Language Models

Ayush Agrawal, Siddhartha Gadgil, Navin Goyal, Ashvni Narayanan, Anand Tadipatri

TL;DR

This work investigates using Codex, an LLM, as a Mathematics Formalisation Assistant for Lean by combining input-dependent prompting and postprocessing to autoformalise NL theorems and, to a lesser extent, NL proofs. The authors demonstrate that theorem statements can be translated with notable accuracy on a $120$-theorem benchmark, while proofs are tackled via a first, design-sensitive study yielding repairable, partially correct outputs on a curated set of $13$ theorems. A novel prompting strategy—leveraging mathlib-style docstrings, sentence similarity, and keyword retrieval—paired with Lean 4 elaboration and parsing, underpins the improvements in theorem statement translation. The results suggest a promising path toward partially or fully automated formalisation in Lean, while highlighting data limitations and the need for further advances in prompt design, data collection, and integration with automated proof search and human-in-the-loop repair.

Abstract

Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful input-dependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75\% accuracy for $120$ theorem statements. For proofs quantitative analysis is infeasible and we undertake a detailed case study. We choose a diverse set of $13$ theorems at undergrad level with proofs that fit in two-three paragraphs. We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof. This is surprising as essentially no aligned data exists for formalised mathematics, particularly for proofs. These results suggest that large language models are a promising avenue towards fully or partially automating formalisation.

Towards a Mathematics Formalisation Assistant using Large Language Models

TL;DR

-theorem benchmark, while proofs are tackled via a first, design-sensitive study yielding repairable, partially correct outputs on a curated set of

theorems. A novel prompting strategy—leveraging mathlib-style docstrings, sentence similarity, and keyword retrieval—paired with Lean 4 elaboration and parsing, underpins the improvements in theorem statement translation. The results suggest a promising path toward partially or fully automated formalisation in Lean, while highlighting data limitations and the need for further advances in prompt design, data collection, and integration with automated proof search and human-in-the-loop repair.

Abstract

theorem statements. For proofs quantitative analysis is infeasible and we undertake a detailed case study. We choose a diverse set of

theorems at undergrad level with proofs that fit in two-three paragraphs. We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof. This is surprising as essentially no aligned data exists for formalised mathematics, particularly for proofs. These results suggest that large language models are a promising avenue towards fully or partially automating formalisation.

Paper Structure (48 sections, 19 equations, 11 figures, 18 tables)

This paper contains 48 sections, 19 equations, 11 figures, 18 tables.

Introduction
Background and related work
Autoformalising theorem statements
Evaluation datasets
Techniques
Results
Autoformalisation of Proofs
Methodology
Results
Limitations and future work
The Lean prover and reasoning styles
Theorem Statement Translation : Further Details
Full Example prompt
Parsing, translation and auto-correction
Selection
...and 33 more sections

Figures (11)

Figure 1: Example of a prompt, the initial result and the result after post-processing. Part of the prompt was elided to save space; full prompt appears in Appendix \ref{['sec:full_theorem_prompt']}.
Figure 2: One of the examples in the prompt for the full-proof-with-comments format
Figure 3: Correction of a Codex completion of Absolute Value Function is Convex. The text highlighted in red is to be deleted and the text highlighted in green and underlined is to be added.
Figure 4: Four proofs of $3 < 7$ in Lean
Figure 5: Complete prompt used in Figure \ref{['codex-example']}.
...and 6 more figures

Towards a Mathematics Formalisation Assistant using Large Language Models

TL;DR

Abstract

Towards a Mathematics Formalisation Assistant using Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)