AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement
J Rosser, Jakob Foerster
TL;DR
AgentBreeder presents a multi-objective evolutionary framework for discovering and evaluating multi-agent scaffolds, balancing capability and safety via $f_C(s)$ and $f_S(s)$. It employs seed scaffolds, mutation/crossover via a Meta Agent, descriptor-based clustering, and Pareto-front-based elitism to generate diverse high-performing architectures. Experiments across BlueTeam and RedTeam paradigms show substantial safety improvements while preserving or enhancing capability, and they reveal that red-team scaffolds can uncover meaningful adversarial vulnerabilities. The work demonstrates the practicality of automated, safety-aware scaffold design and evaluation as a pre-deployment safety tool for complex multi-agent systems.
Abstract
Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In "blue" mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In "red" mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/jrosseruk/AgentBreeder.
