Table of Contents
Fetching ...

WybeCoder: Verified Imperative Code Generation

Fabian Gloeckle, Mantas Baksys, Darius Feher, Kunhao Zheng, Amaury Hayat, Sean B. Holden, Gabriel Synnaeve, Peter O'Hearn

Abstract

Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

WybeCoder: Verified Imperative Code Generation

Abstract

Recent progress in large language models (LLMs) has advanced automatic code generation and formal theorem proving, yet software verification has not seen the same improvement. To address this gap, we propose WybeCoder, an agentic code verification framework that enables prove-as-you-generate development where code, invariants, and proofs co-evolve. It builds on a recent framework that combines automatic verification condition generation and SMT solvers with interactive proofs in Lean. To enable systematic evaluation, we translate two benchmarks for functional verification in Lean, Verina and Clever, to equivalent imperative code specifications. On complex algorithms such as Heapsort, we observe consistent performance improvements by scaling our approach, synthesizing dozens of valid invariants and dispatching of dozens of subgoals, resulting in hundreds of lines of verified code, overcoming plateaus reported in previous works. Our best system solves 74% of Verina tasks and 62% of Clever tasks at moderate compute budgets, significantly surpassing previous evaluations and paving a path to automated construction of large-scale datasets of verified imperative code.

Paper Structure

This paper contains 48 sections, 2 equations, 9 figures, 3 tables, 3 algorithms.

Figures (9)

  • Figure 1: WybeCoder: Subgoal decomposition multi-agent system. Starting from a problem specification, the agent generates an implementation and attempts to discharge verification conditions using CVC5. Remaining goals are tackled interactively in Lean, driving iterative implementation refinement or completing the proof.
  • Figure 2: Inference scaling for sequential agents and multi-agent system on Verina. Different pass@k settings entail different compute-performance scalings for sequential agents which we compare to a multi-agent system. The optimal mechanism depends on the specific model in use, with Gemini-3 benefiting from the multi-agent scaffold while Claude 4.5 Opus does not.
  • Figure 3: Sequential vs Parallel Compute. Given a maximum inference budget of up to $C$ language model calls, we compute the optimal breakdown into $kT \leq C$ with $k \leq 32$ the number of independent attempts and $T \leq 32$ the maximum number of turns per attempt. Optimal $T$ ranges between 6 and 16 turns, increases beyond that are an artifact of bounded $k$ (Figure \ref{['fig:optimal']}).
  • Figure 4: Multiple attempts with multi-agent systems.
  • Figure 5: Inference scaling comparison for multi-agent system with different models. We evaluate using up to 128 subagents on Verina and plot by the total number of LLM calls spent according to iterative deepening (Section \ref{['app:tradeoff']}). See Figure \ref{['fig:decomp-scale']} for comparisons using latency and the Clever-Loom benchmark.
  • ...and 4 more figures