An Evaluation Benchmark for Autoformalization in Lean4

Aryan Gulati; Devanshu Ladsaria; Shubhra Mishra; Jasdeep Sidhu; Brando Miranda

An Evaluation Benchmark for Autoformalization in Lean4

Aryan Gulati, Devanshu Ladsaria, Shubhra Mishra, Jasdeep Sidhu, Brando Miranda

TL;DR

The paper tackles the challenge of autoformalizing informal mathematics into Lean4 by introducing a new benchmark built from 101 theorem statements across 17 topics drawn from mathlib4. It evaluates three contemporary LLMs in a zero-shot setting using a $0$–$4$ correction-effort scoring, revealing that despite progress, current models still require substantial edits, with average efforts around $2.238$–$2.248$ and notable topic-dependent variability. GPT-4 and Gemini Pro show some improvements in producing correct local forms but also yield a higher share of severely incorrect outputs (score $4$) in harder domains, illustrating the limits of current models for complex autoformalization tasks. The work establishes a systematic, topic-aware evaluation framework that can guide future model development and integration with automated theorem proving in Lean4, aiming to close the gap between natural language formulations and fully verified formal proofs.

Abstract

Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization.

An Evaluation Benchmark for Autoformalization in Lean4

TL;DR

–

correction-effort scoring, revealing that despite progress, current models still require substantial edits, with average efforts around

–

and notable topic-dependent variability. GPT-4 and Gemini Pro show some improvements in producing correct local forms but also yield a higher share of severely incorrect outputs (score

) in harder domains, illustrating the limits of current models for complex autoformalization tasks. The work establishes a systematic, topic-aware evaluation framework that can guide future model development and integration with automated theorem proving in Lean4, aiming to close the gap between natural language formulations and fully verified formal proofs.

Abstract

Paper Structure (7 sections, 1 figure, 1 table)

This paper contains 7 sections, 1 figure, 1 table.

Introduction
Methodology and Results
Conclusion
Dataset
Inside the Dataset
Sample Prompts and Outputs
Graphing the Correction Effort for Separate Models

Figures (1)

Figure 1: Average Correction Efforts Across Topics

An Evaluation Benchmark for Autoformalization in Lean4

TL;DR

Abstract

An Evaluation Benchmark for Autoformalization in Lean4

Authors

TL;DR

Abstract

Table of Contents

Figures (1)