Table of Contents
Fetching ...

A Minimal Agent for Automated Theorem Proving

Borja Requena Pozo, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra

TL;DR

A minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures, and demonstrates consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness.

Abstract

We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.

A Minimal Agent for Automated Theorem Proving

TL;DR

A minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures, and demonstrates consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness.

Abstract

We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
Paper Structure (27 sections, 3 figures, 3 tables)

This paper contains 27 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Minimal agent for theorem proving. The design focuses on three main aspects: iterative proof refinement, memory system, and access to tools. A proposer agent writes Lean code to prove a given theorem. Then, a compiler verifies if the proposed proof works. If so, it is double-checked by a reviewer agent to prevent any form of cheating. If the code does not compile, or the reviewer objects, the feedback is sent to the memory module, and the cycle starts over with the proposer refining its previous proof. In addition, the proposer can be given access to tools such as library search or web search, and call them a fixed number of times before giving its proposal.
  • Figure 2: Ablation study on different components of the minimal agent. We compare single-shot parallel calls of a general purpose LLM with an iterative approach with feedback. We show the performance as a function of the number of sampled proofs as additional elements are added to the system. Starting by adding a feedback mechanism, we then evaluate two different memory mechanisms (history of previous attempts and a context managed through self-reflection), and we finally incorporate search tools. The entire experiment is conducted with Claude Opus 4.5 with a thinking budget of $10,000$ tokens. The pass@$k$ values are computed out of 50 independent samples, and we show the 95% confidence interval around them. For the iterative approach, we compute the mean percentage of proven theorems at each iteration and show the standard error of the mean around it. We use two samples for the different ablations, and three for the full system.
  • Figure 3: (a) Comparison between LLMs. Claude Sonnet 4.5 and Opus 4.5 have a thinking budget of $10,000$ tokens. Both Gemini 3 Flash and Pro have thinking set to "high". The pass@$k$ are computed from 50 independent proof samples for each model and we show the 95% confidence interval around the obtained value. For the full system, we show the mean performance over 3 full runs up to 50 iterations and show the standard error of the mean. (b) Cost-performance analysis. We compare the performance of different language models as we vary their thinking budget and compare their performance with respect to their cost. We consider thinking budgets of 2k, 10k and 32k tokens for both Claude models, and "low" and "high" for the Gemini models. We also test the "minimal" option for Gemini Flash.