Table of Contents
Fetching ...

SEVerA: Verified Synthesis of Self-Evolving Agents

Debangshu Banerjee, Changming Xu, Gagandeep Singh

Abstract

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($τ^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

SEVerA: Verified Synthesis of Self-Evolving Agents

Abstract

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use (-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

Paper Structure

This paper contains 73 sections, 11 theorems, 47 equations, 10 figures, 8 tables, 4 algorithms.

Key Result

lemma 1

For any valid FGGM $\mathsf{G}{}$ with quantifier-free $\Psi_{l}$, $\forall x_1 \in T_1, \dots, \forall x_n \in T_n, \forall y \in T_o.\; \Phi_{l}(x_1, \dots, x_n) \implies \left(\textcolor{funcpurple}{check_{\mathcal{A}, \Phi_{l}, \Psi_{l}}}(x_1, \dots, x_n, y) \iff \Psi_{l}(x_1, \dots, x_n, y)\rig

Figures (10)

  • Figure 1: Overview of SEVerA, which operates in three key steps. (1) Search: Given the task information, library functions $\mathcal{F}_{c}$, a list of parametric generative models (GMs) $\mathcal{F}_{p}$, and specifications $(\Phi_{}, \Psi_{})$, the planner LLM outputs a parametric agentic program in which all formal output specifications of GM calls are defined using the FGGM setup. (2) Verify: Given the parametric agentic code, this step first verifies all FGGM definitions proposed by the planner. Once verified, it uses their local contracts to check the program against $(\Phi, \Psi)$. If all checks pass, the program is accepted and the unconstrained learning step is invoked on the verified program; otherwise, an error message is returned to the planner. (3) Learn: This step optimizes the parameters of the underlying GMs within each FGGM to improve conformance with local contracts defined in step (1), while also reducing the task-loss $\mathsf{L}{}$ over the dataset $\mathrm{D}$. After tuning, SEVerA maintains a pool of verified fine-tuned agents and uses their execution traces on $\mathrm{D}$ to generate new candidates via step (1). All agents in the pool are valid candidate solutions, and the one with the lowest $\mathsf{L}{}$ on $\mathrm{D}$ is returned. Central to SEVerA is the Formally Guarded Generative Model (FGGM), which binds each GM call to local input--output contracts $(\Phi_{l}, \Psi_{l})$ and a verified non-parametric fallback, ensuring that the contracts hold irrespective of the underlying GM parameters. This enables unconstrained gradient-based parameter optimization without compromising formal correctness.
  • Figure 2: Auto-synthesized guarded GM with rejection sampler (1) and fallback (2).
  • Figure 3: Two example candidate parametric programs generated for symbolic regression.
  • Figure 4: FGGM definition for $\textcolor{funcpurple}{initialFGGM}$. The prompting function builds context-specific guidance by checking for lemmas, generics, sets, and multisets in the base program. The fallback returns the original program, satisfying $\Psi_{l}$ by the reflexivity axiom.
  • Figure 5: FGGM definition for $\textcolor{funcpurple}{diffErrorFGGM}$. Invoked when the previous iteration's output failed the diff checker. The prompt includes the diff error and previous attempt, instructing the model to preserve the base program logic.
  • ...and 5 more figures

Theorems & Definitions (24)

  • definition 1: LLMs
  • definition 2
  • lemma 1: Completeness of Checker
  • proof : Proof Sketch
  • theorem 1: Valid Local contract
  • proof : Proof Sketch
  • theorem 2
  • proof : Proof Sketch
  • theorem 3: Soundness
  • proof : Proof Sketch
  • ...and 14 more