Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Zenan Li; Ziran Yang; Deyuan; He; Haoyu Zhao; Andrew Zhao; Shange Tang; Kaiyu Yang; Aarti Gupta; Zhendong Su; Chi Jin

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Zenan Li, Ziran Yang, Deyuan, He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, Chi Jin

Abstract

Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6$\times$ improvement over the strongest baseline, surpassing neural provers up to 84$\times$ larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Abstract

improvement over the strongest baseline, surpassing neural provers up to 84

larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.

Paper Structure (34 sections, 5 equations, 12 figures, 3 tables, 2 algorithms)

This paper contains 34 sections, 5 equations, 12 figures, 3 tables, 2 algorithms.

Introduction
Lean-based Code Verification
Problem Formulation
Verification Paradigms and Why Lean
Key Challenges
Hierarchical Verification Framework
Score for Lemma Decomposition
Unified Policy Learning and Inference
Training Pipeline
Inference Pipeline
Experiments
Experimental Setup
RQ1: Overall Effectiveness
RQ2: Inference-Time Scaling
RQ3: Decomposition Analysis
...and 19 more sections

Figures (12)

Figure 1: A code verification task in Lean 4. The blue block defines the specification: a precondition constraining valid inputs and a postcondition characterizing correct outputs. The red block states the top-level verification goal, where marks the proof obligation to be synthesized automatically. The complete proof synthesized by our framework is shown in Appendix \ref{['app:running-example']}.
Figure 2: Overview of the hierarchical proof search framework.Stage 1 (Recursive Lemma Decomposition): The verification goal is recursively decomposed into structurally simpler sub-lemmas. At each step, the LLM policy $\pi$ proposes candidates verified via proof reconstruction and quickcheck; the decomposition score $S$ ranks candidates by combining validity with structural reduction in operator footprint $d(\cdot)$. Stage 2 (Iterative Lemma Completion): Leaf lemmas are proved by the same policy through iterative tactic generation with Lean 4 compiler feedback. The same score serves as both the training reward and the inference-time ranking criterion.
Figure 3: Number of solved problems by baselines and our framework across three benchmarks. Baselines are evaluated with parallel generation under a Pass@128 budget; our method operates under a search-based inference setting using Göedel-Code-Prover-8B. Our framework outperforms all baselines by a substantial margin across every benchmark.
Figure 4: Decomposition reduction rate vs. iterations across three benchmarks. Lower values indicate more aggressive goal simplification.
Figure 5: Prove success rate vs. completion iterations under different pass@$k$ budgets.
...and 7 more figures

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Abstract

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Authors

Abstract

Table of Contents

Figures (12)