Table of Contents
Fetching ...

FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization

Haijian Lu, Wei Wang, Jing Liu

Abstract

Autoformalization aims to translate natural-language mathematics into compilable, machine-checkable statements. However, semantic consistency does not imply prover effectiveness: even semantically consistent formalizations can differ substantially in proof-search cost and success rate. In this work, we formulate autoformalization as a budgeted, test-time search for semantically consistent repertoires, and propose FormalEvolve, a compilation-gated neuro-symbolic evolutionary framework. FormalEvolve generates diverse candidates via LLM-driven mutation and crossover with bounded patch repair, while symbolic Abstract Syntax Tree (AST) rewrite operations further inject structural diversity. On CombiBench and ProofNet, under a strict generator-call budget of T = 100, FormalEvolve reaches semantic hit rates (SH@100) of 58.0% and 84.9%, and reduces cross-problem concentration of semantic successes(lower Gini). Under a fixed prover budget, FormalEvolve also improves downstream proving performance on CombiBench. Code will be released publicly.

FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization

Abstract

Autoformalization aims to translate natural-language mathematics into compilable, machine-checkable statements. However, semantic consistency does not imply prover effectiveness: even semantically consistent formalizations can differ substantially in proof-search cost and success rate. In this work, we formulate autoformalization as a budgeted, test-time search for semantically consistent repertoires, and propose FormalEvolve, a compilation-gated neuro-symbolic evolutionary framework. FormalEvolve generates diverse candidates via LLM-driven mutation and crossover with bounded patch repair, while symbolic Abstract Syntax Tree (AST) rewrite operations further inject structural diversity. On CombiBench and ProofNet, under a strict generator-call budget of T = 100, FormalEvolve reaches semantic hit rates (SH@100) of 58.0% and 84.9%, and reduces cross-problem concentration of semantic successes(lower Gini). Under a fixed prover budget, FormalEvolve also improves downstream proving performance on CombiBench. Code will be released publicly.
Paper Structure (113 sections, 5 equations, 12 figures, 18 tables, 3 algorithms)

This paper contains 113 sections, 5 equations, 12 figures, 18 tables, 3 algorithms.

Figures (12)

  • Figure 1: Motivating illustration under a fixed prover and attempt budget ($N{=}64$). Even when candidate statements compile and are judged semantically consistent, prover outcomes can vary substantially. Constructing a diverse repertoire hedges against this sensitivity and increases the chance that at least one provable statement is found within budget.
  • Figure 2: Framework overview of FormalEvolve. A seed model produces an initial seedbank (debited as generator calls) that initializes a compilation-feasible archive. Candidates are stored with semantic scores and selected with usage penalties; a patch model proposes edits (full rewrite, diff patching, and cross patching) and triggers bounded repair or EvolAST fallbacks on compilation failures. Semantic scoring filters the archive into a diverse repertoire for downstream proving and evaluation.
  • Figure 3: Coverage and first-hit timing across two benchmarks ($t\le 100$).
  • Figure 4: Filtered per-problem semantic-success counts at $T=100$ under strict budget accounting (semantic_ok; $0$ shown as white). We keep problems where at least one method has a positive semantic_ok count and order columns by Compile+Semantic Repair to keep the ordering independent of ours.
  • Figure 5: Proof utility as a function of prover attempts $k$ on the semantically consistent repertoire produced within $T=100$ calls. Solid: pass@k; dashed: complete@k; bottom row shows the gap relative to Kimina Compile+Semantic Repair.
  • ...and 7 more figures