Table of Contents
Fetching ...

Evolutionary Generation of Multi-Agent Systems

Yuntong Hu, Matthew Trager, Yuting Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

TL;DR

EvoMAS reframes multi-agent system design as structured configuration generation and evolves MAS configurations via feedback-guided mutation, crossover, and memory reuse. By maintaining an experience memory and a pool seeded with human-designed MAS, EvoMAS discovers task-adaptive architectures that balance execution reliability with performance, outperforming both hand-crafted baselines and prior automatic MAS generation methods. Across reasoning, coding, and tool-use benchmarks, EvoMAS achieves higher task accuracy and near-universal executability, while demonstrating scalable compute efficiency and transferability of evolved designs. The work introduces a principled, configuration-based evolutionary optimization paradigm that enables robust, generalizable MAS design for diverse real-world tasks.

Abstract

Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard.

Evolutionary Generation of Multi-Agent Systems

TL;DR

EvoMAS reframes multi-agent system design as structured configuration generation and evolves MAS configurations via feedback-guided mutation, crossover, and memory reuse. By maintaining an experience memory and a pool seeded with human-designed MAS, EvoMAS discovers task-adaptive architectures that balance execution reliability with performance, outperforming both hand-crafted baselines and prior automatic MAS generation methods. Across reasoning, coding, and tool-use benchmarks, EvoMAS achieves higher task accuracy and near-universal executability, while demonstrating scalable compute efficiency and transferability of evolved designs. The work introduces a principled, configuration-based evolutionary optimization paradigm that enables robust, generalizable MAS design for diverse real-world tasks.

Abstract

Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard.
Paper Structure (110 sections, 6 equations, 13 figures, 9 tables)

This paper contains 110 sections, 6 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Overview of EvoMAS. Given a task, the MAS generator produces structured configurations specifying agent roles, model assignments, prompts, and communication topology. The MAS executor instantiates agents accordingly and executes the task. A verifier evaluates outputs to compute reward signals, which guide evolutionary optimization through mutation and selection over multiple generations.
  • Figure 2: Trade-off between execution rate and task performance for MAS generation methods. Each point represents a method, with execution rate (%) on the x-axis and task performance (%) on the y-axis. EvoMAS achieves both high execution reliability and superior task performance across all benchmarks.
  • Figure 3: Results on state-of-the-art LLM (Claude-4.5-Sonnet). We compare Direct LLM Call, Single Agent, Majority Vote, and EvoMAS using Claude-4.5-Sonnet as both the MAS generator and agent backbone. EvoMAS demonstrates strong performance with the latest frontier model, achieving particularly notable results on SWE-Bench-Verified. Additional results with Claude-3.5-Sonnet and Claude-4-Sonnet are provided in \ref{['fig:ab:generator:appendix']}\ref{['fig:ab:generator:appendix']}.
  • Figure 4: Scaling ability on BBEH-Mini. Each line starts at its natural operating cost (circled). EvoMAS ($\star$) outperforms all baselines across budgets and continues to improve with additional compute, while other methods plateau or degrade.
  • Figure 5: Results with earlier Claude Sonnet models. We compare Direct LLM Call, Single Agent, Majority Vote, and EvoMAS using (a) Claude-3.5-Sonnet and (b) Claude-4-Sonnet as both the MAS generator and agent backbone. EvoMAS consistently outperforms baselines across both model generations. Performance scales with model capability, with Claude-4-Sonnet achieving substantially higher absolute performance than Claude-3.5-Sonnet across all methods.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 2.1: Multi-Agent System Configuration