Table of Contents
Fetching ...

Hilbert: Recursively Building Formal Proofs with Informal Reasoning

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, Ke Ye

TL;DR

Hilbert introduces a hierarchical, retrieval-augmented framework that unites informal mathematical reasoning with formal Lean-based verification to close the gap between human-like reasoning and machine-verified proofs. By coordinating a Reasoner, Prover, Verifier, and Retriever, and employing recursive subgoal decomposition, it achieves state-of-the-art results on miniF2F ($$99.2\%$$) and PutnamBench ($$70.0\%$$) while maintaining efficiency gains through retrieval. The method demonstrates that hybridizing informal and formal approaches, with verifier feedback and iterative refinement, yields substantially more reliable proof generation than either paradigm alone. The work also highlights the potential of using reasoning traces to further train and improve both prover and reasoner models in a virtuous cycle of improvement.

Abstract

Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the best known result on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.

Hilbert: Recursively Building Formal Proofs with Informal Reasoning

TL;DR

Hilbert introduces a hierarchical, retrieval-augmented framework that unites informal mathematical reasoning with formal Lean-based verification to close the gap between human-like reasoning and machine-verified proofs. By coordinating a Reasoner, Prover, Verifier, and Retriever, and employing recursive subgoal decomposition, it achieves state-of-the-art results on miniF2F () and PutnamBench () while maintaining efficiency gains through retrieval. The method demonstrates that hybridizing informal and formal approaches, with verifier feedback and iterative refinement, yields substantially more reliable proof generation than either paradigm alone. The work also highlights the potential of using reasoning traces to further train and improve both prover and reasoner models in a virtuous cycle of improvement.

Abstract

Large Language Models (LLMs) demonstrate impressive mathematical reasoning abilities, but their solutions frequently contain errors that cannot be automatically verified. Formal theorem proving systems such as Lean 4 offer automated verification with complete accuracy, motivating recent efforts to build specialized prover LLMs that generate verifiable proofs in formal languages. However, a significant gap remains: current prover LLMs solve substantially fewer problems than general-purpose LLMs operating in natural language. We introduce Hilbert, an agentic framework that bridges this gap by combining the complementary strengths of informal reasoning and formal verification. Our system orchestrates four components: an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever. Given a problem that the prover is unable to solve, Hilbert employs recursive decomposition to split the problem into subgoals that it solves with the prover or reasoner LLM. It leverages verifier feedback to refine incorrect proofs as necessary. Experimental results demonstrate that Hilbert substantially outperforms existing approaches on key benchmarks, achieving 99.2% on miniF2F, 6.6% points above the best publicly available method. Hilbert achieves the best known result on PutnamBench. It solves 462/660 problems (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and achieving a 422% improvement over the best publicly available baseline. Thus, Hilbert effectively narrows the gap between informal reasoning and formal proof generation.

Paper Structure

This paper contains 21 sections, 9 figures, 3 tables, 8 algorithms.

Figures (9)

  • Figure 1: The Hilbert algorithm. Given a target theorem, Hilbert attempts formal proof generation with the prover. Upon failure, it decomposes the problem into subgoals and tries to solve them with the prover, followed by the reasoner (shallow solve). If both strategies fail, it resorts to recursive decomposition until all subgoals are resolved.
  • Figure 2: Subgoal Decomposition: Given a theorem statement, Hilbert: (1) retrieves relevant theorems from Mathlib using semantic search, (2) generates a formal proof sketch with subgoals marked as have statements with sorry placeholders, (3) extracts these subgoals as independent theorem statements, and (4) assembles the proof by replacing sorry placeholders with calls to the subgoal theorems. Verifiers ensure correctness at each stage. The error correction loops are indicated by dotted lines.
  • Figure 3: Pass rate (vs) Inference-time Budget. We plot the pass-rate for Hilbert on MiniF2F as a function of (left) the number of Reasoner calls (right) the total number of LLM (Reasoner + Prover) calls per sample.
  • Figure 4: Pass rate (vs) recursive depth $D$ on MiniF2F for Hilbert (Gemini 2.5 Pro) + Goedel-Prover-V2-32B
  • Figure 5: Subgoal Decomposition Example. We illustrate the subgoal decomposition process using the input theorem sqrt_ten_irrational. The process consists of four main steps: (1) We retrieve relevant theorems from Mathlib to inform the proof strategy. (2) The Reasoner generates a proof sketch, which is verified by the Lean Verifier for validity. If verification fails, error messages guide the Reasoner to make corrections. (3) The Reasoner extracts subgoals from the validated sketch and verifies their correctness, refining them as needed. (4) The Reasoner assembles a complete proof by incorporating the extracted subgoals into the original sketch. Since the subgoals lack proofs at this stage, they are denoted by sorry. This assembled proof undergoes final verification. The process outputs both the complete assembled proof and the verified subgoals (without their proofs). Note that while Steps (3) and (4) are shown together in this figure for simplicity, they represent distinct operations as detailed in Figure \ref{['fig:subgoal_decomp']}.
  • ...and 4 more figures