Table of Contents
Fetching ...

An Evaluation Benchmark for Autoformalization in Lean4

Aryan Gulati, Devanshu Ladsaria, Shubhra Mishra, Jasdeep Sidhu, Brando Miranda

TL;DR

The paper tackles the challenge of autoformalizing informal mathematics into Lean4 by introducing a new benchmark built from 101 theorem statements across 17 topics drawn from mathlib4. It evaluates three contemporary LLMs in a zero-shot setting using a $0$–$4$ correction-effort scoring, revealing that despite progress, current models still require substantial edits, with average efforts around $2.238$–$2.248$ and notable topic-dependent variability. GPT-4 and Gemini Pro show some improvements in producing correct local forms but also yield a higher share of severely incorrect outputs (score $4$) in harder domains, illustrating the limits of current models for complex autoformalization tasks. The work establishes a systematic, topic-aware evaluation framework that can guide future model development and integration with automated theorem proving in Lean4, aiming to close the gap between natural language formulations and fully verified formal proofs.

Abstract

Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization.

An Evaluation Benchmark for Autoformalization in Lean4

TL;DR

The paper tackles the challenge of autoformalizing informal mathematics into Lean4 by introducing a new benchmark built from 101 theorem statements across 17 topics drawn from mathlib4. It evaluates three contemporary LLMs in a zero-shot setting using a correction-effort scoring, revealing that despite progress, current models still require substantial edits, with average efforts around and notable topic-dependent variability. GPT-4 and Gemini Pro show some improvements in producing correct local forms but also yield a higher share of severely incorrect outputs (score ) in harder domains, illustrating the limits of current models for complex autoformalization tasks. The work establishes a systematic, topic-aware evaluation framework that can guide future model development and integration with automated theorem proving in Lean4, aiming to close the gap between natural language formulations and fully verified formal proofs.

Abstract

Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization.
Paper Structure (7 sections, 1 figure, 1 table)

This paper contains 7 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Average Correction Efforts Across Topics