Table of Contents
Fetching ...

A Multi-Agent Framework for Stateful Inference-Time Search

Arshika Lalan, Rajat Ghosh, Aditya Kolsur, Debojyoti Dutta

TL;DR

This work presents a training-free, stateful multi-agent evolutionary framework for inference-time unit test generation. By coupling a persistent inference-time state with adversarial mutation and evolutionary preservation, the Actor–Adversary–Critic loop guided by a non-Markovian Controller enhances edge-case discovery and code coverage beyond stateless baselines. Evaluation on HumanEval and TestGenEvalMini across multiple LLM families demonstrates improved coverage, robust edge-case generation, and scalable reasoning for unseen codebases. The approach enables deeper, more reliable reasoning in code-related tasks without parametric model fine-tuning, albeit at higher compute cost and with opportunities for branch-aware enhancements.

Abstract

Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.

A Multi-Agent Framework for Stateful Inference-Time Search

TL;DR

This work presents a training-free, stateful multi-agent evolutionary framework for inference-time unit test generation. By coupling a persistent inference-time state with adversarial mutation and evolutionary preservation, the Actor–Adversary–Critic loop guided by a non-Markovian Controller enhances edge-case discovery and code coverage beyond stateless baselines. Evaluation on HumanEval and TestGenEvalMini across multiple LLM families demonstrates improved coverage, robust edge-case generation, and scalable reasoning for unseen codebases. The approach enables deeper, more reliable reasoning in code-related tasks without parametric model fine-tuning, albeit at higher compute cost and with opportunities for branch-aware enhancements.

Abstract

Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.

Paper Structure

This paper contains 37 sections, 19 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our architecture for unit test generation decomposes the task into two phases: (i) edge case generation from source code and (ii) unit test construction from those cases. The first phase demands deeper reasoning and is addressed through an evolutionary search (as highlighted in the blue box) executed in a stateful manner over multiple stages ($N\times$) by four agents—Actor, Executor, Adversary, and Critic—coordinated by a Controller that propagates persistent state across $N$ evolutionary stages (as highlighted by the magenta line). Once the edge cases converge to sufficient coverage and robustness, they are translated into a complete unit test file via a single-step inference call.
  • Figure 2: Final edge case quality on TestGenEvalMini measured in terms of line, branch, and function coverages across three model families: Gemma-2-27B (top-left), GPT-o4-mini (top-right), and Llama-70B (bottom). The proposed inference-time evolutionary search (SUT) consistently achieves strong coverage, outperforming few-shot and chain-of-thought baselines in most settings.
  • Figure 3: Resolution rate (blue, left axis) and average runtime (red, right axis) per iteration for HumanEval (left) and TestGenEvalMini (right). Higher resolution rate indicates a larger fraction of problems that converge to valid unit tests. Average runtime grows with iteration count as the stateful multi-agent search explores deeper inference-time trajectories.
  • Figure 4: MCP Architecture overview
  • Figure 5: Simplified excerpt from Django ORM internals
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1: State
  • Definition 2: Actor
  • Definition 3: Adversary
  • Definition 4: Critic
  • Definition 5: Executor
  • Definition 6: Controller