Table of Contents
Fetching ...

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

TL;DR

SorryDB is presented, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub that mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects.

Abstract

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

TL;DR

SorryDB is presented, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub that mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects.

Abstract

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.
Paper Structure (34 sections, 11 figures, 3 tables)

This paper contains 34 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Dataset extraction pipeline for SorryDB. We consider $78$ repositories on active GitHub Lean projects and list all the theorems that contain a sorry. We save the metadata for each of them and store it in the database. A project is considered active when it had at least one commit in the previous three months.
  • Figure 2: Success rate of different provers split by repository category. We compare general purpose LLMs, specialized models (pass@32), self-correcting (SC) and agentic approaches ($16$ iterations). We see that tasks from pedagogical repositories are easier to prove, while those from math formalization projects are harder. A specialized prover such as Goedel Prover works relatively better on benchmark repositories but has worse performance on other project types.
  • Figure 3: Analysis of prover complementarity. (a) Iterative approaches with error feedback strongly outperform parallel sampling with the same number of calls. (b) Different provers solve different subsets of problems, indicating complementarity.
  • Figure 4: Three attempts at filling the same sorry in the brownian-motion repository. (top) Goedel Prover outlines the proof structure but abandons each sub-goal with sorry, unable to find the required lemmas. (middle) Gemini with LeanSearch attempts a syntactically-correct guess using continuous_comp, a non-existent library function it assumed to exist. (bottom) Gemini self-correcting (without LeanSearch) inspects the definition of IsCadlag and manually constructs a proof term by providing the two required fields: right_continuous and left_limit.
  • Figure 5: Distribution of git blame dates in SorryDB-2601 and the 1000 sorry sample. Serving as a proxy for the time of addition, these dates reveal that most sorries are very recent.
  • ...and 6 more figures