Table of Contents
Fetching ...

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

J Rosser, Jakob Foerster

TL;DR

AgentBreeder presents a multi-objective evolutionary framework for discovering and evaluating multi-agent scaffolds, balancing capability and safety via $f_C(s)$ and $f_S(s)$. It employs seed scaffolds, mutation/crossover via a Meta Agent, descriptor-based clustering, and Pareto-front-based elitism to generate diverse high-performing architectures. Experiments across BlueTeam and RedTeam paradigms show substantial safety improvements while preserving or enhancing capability, and they reveal that red-team scaffolds can uncover meaningful adversarial vulnerabilities. The work demonstrates the practicality of automated, safety-aware scaffold design and evaluation as a pre-deployment safety tool for complex multi-agent systems.

Abstract

Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/jrosseruk/AgentBreeder.

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

TL;DR

AgentBreeder presents a multi-objective evolutionary framework for discovering and evaluating multi-agent scaffolds, balancing capability and safety via and . It employs seed scaffolds, mutation/crossover via a Meta Agent, descriptor-based clustering, and Pareto-front-based elitism to generate diverse high-performing architectures. Experiments across BlueTeam and RedTeam paradigms show substantial safety improvements while preserving or enhancing capability, and they reveal that red-team scaffolds can uncover meaningful adversarial vulnerabilities. The work demonstrates the practicality of automated, safety-aware scaffold design and evaluation as a pre-deployment safety tool for complex multi-agent systems.

Abstract

Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/jrosseruk/AgentBreeder.

Paper Structure

This paper contains 28 sections, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: A high‐level illustration of the AgentBreeder algorithm as outlined in Algorithm \ref{['alg:agentbreeder_simplified']}. Starting from seed scaffolds $Q_0$, at each generation $g$ the newly generated scaffolds $Q_{g-1}$ are evaluated on capability ($f_C(s)$) and/or safety ($f_S(s)$) benchmarks, then embedded via $f_D(s)$ for clustering $A(\cdot)$ into $K$ clusters. Within each cluster, Pareto fronts ${F_1, ..., F_K}$ are identified according to $f_C(s)$ and/or $f_S(s)$, and these "frontier" solutions become the elite set $E_g$. An LLM-based Meta Agent applies crossover and mutation to the elites, creating new offspring scaffolds $Q_g$. These offspring are added to the population for the next generation $P_{g+1}$. By repeating this process for $G$ generations, AgentBreeder explores a large, diverse set of multi‐agent systems while balancing capability and safety. AgentBreeder can be run in 3 different modes and the right-hand side of this figure shows the optimal direction of travel of the Pareto front for each generation. BlueAgentBreeder is a defense mode and seeks to maximize both capability and safety, whereas RedAgentBreeder is an attack mode minimizing safety. CapableAgentBreeder serves as our baseline, only optimizing for capability without regard to safety.
  • Figure 2: BlueAgentBreeder evolves scaffolds that improve the capability-safety Pareto frontier. The plots show the evolution of multi-agent scaffolds across 20 generations on the validation set of three different benchmarks: GPQA (left), MMLU (middle), and DROP (right). Each point represents a scaffold, with colors indicating generation (lighter blue for seed scaffolds, darker blue for later generations). The x-axis measures capability ($f_C$) and the y-axis measures safety ($f_S$). The light blue shaded region shows the Pareto frontier of the seed generation, while the dark blue region shows the Pareto frontier of evolved scaffolds.
  • Figure 3: RedAgentBreeder evolves scaffolds over 10 generations. The validation set results are reported, with the x-axis shows capability ($f_C$) and the y-axis shows inverted safety (1-$f_S$), with higher values on both axes representing the optimization objectives. Lighter red points indicate seed scaffolds (Gen 0), while darker red points represent scaffolds from later generations (up to Gen 10). The shaded regions highlight the Pareto regions of both seed scaffolds (lighter area) and evolved scaffolds (darker area), with the connected points along the boundary representing non-dominated solutions.