Table of Contents
Fetching ...

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

Ido Pinto, Yizhak Yisrael Elboher, Haoze Wu, Nina Narodytska, Guy Katz

Abstract

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

Abstract

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.
Paper Structure (30 sections, 2 theorems, 5 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 30 sections, 2 theorems, 5 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Proposition 4.1

For any predicate $\varphi$, the normalized predicate $\varphi' = \textsc{Normalize}(\varphi)$ is semantically equivalent to the original ($\varphi' \equiv \varphi$).

Figures (8)

  • Figure 1: Illustration of the Virtual Best Solver (VBS) metric relative to our verification procedure. Given a query $\langle A, P, q \rangle$, the LLM proposes a candidate property $I=(l, \varphi)$. The tasks run in parallel: $\mathcal{V}_1$ evaluates correctness (inductiveness) while $\mathcal{V}_2$ evaluates sufficiency. VBS captures the fastest conclusive path: if the decomposed checks yield a correct invariant with conclusive result (per Table \ref{['tab:decision']}), we record $\min(t_b, t_v)$; otherwise we revert to baseline time $t_b$.
  • Figure 2: Wonda pipeline. The raw verifier output ($V_0$) is normalized ($V_1$), then an LLM simplifies it to a compact closed-form expression ($V_2$): $5x + 3y = 300$, achieving a 2x verification speedup.
  • Figure 3: Inference-time invariant predictions on cohendiv_ll_valuebound50_6.c inner loop from the evaluation set across all stages. The progression demonstrates that the fine-tuning process works as expected where V2 successfully synthesizes the correct relationship $a \times y = b$. This V2 invariant drastically reduces the total verification runtime from 214.28s down to 5.39s, achieving a significant 39.75$\times$ speedup (including the LLM inference time).
  • Figure 4: Example of a raw verbose invariant generated by the verifier.
  • Figure 5: Invariant refinement pipeline. The LLM factors out the common constraint N <= 10 and reduces case enumeration into a compact range, achieving a 3.47x verification speedup.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 3.1: Property
  • Definition 3.2: Verification Query
  • Definition 3.3: Verification Query Validity
  • Definition 3.4: Verification Oracle
  • Proposition 4.1: Normalization Soundness
  • proof
  • Definition 4.2: Quality Grade $Q$
  • Proposition 4.3: Pipeline Soundness
  • proof