Inefficiencies of Meta Agents for Agent Design
Batu El, Mert Yuksekgonul, James Zou
TL;DR
This work investigates automating agent design through meta-agents that follow a sample–evaluate–iterate loop and identifies three core challenges: learning from prior designs, diversity and complementarity among generated agents, and economic viability. It compares three context-curation strategies—Cumulative, Parallel, and Evolutionary—across MGSM, DROP, MMLU, and GPQA, using GPT-3.5 for agents and GPT-4o for the meta-agent. The results show that cumulative context underperforms, while evolutionary context can yield up to $+10\%$ gains on MGSM, albeit with reduced behavioral diversity; parallel context often provides better diversity and coverage but not always better performance. The cost analysis reveals a break-even point around $n \approx 15{,}000$ examples for certain datasets, with limited viability across others, highlighting a nuanced balance between performance gains and automation costs. Overall, the paper offers actionable guidance on when automated agent design is advantageous and emphasizes the trade-offs between performance, diversity, and economics in this meta-design paradigm.
Abstract
Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases--specifically, two datasets--the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.
