Table of Contents
Fetching ...

Inefficiencies of Meta Agents for Agent Design

Batu El, Mert Yuksekgonul, James Zou

TL;DR

This work investigates automating agent design through meta-agents that follow a sample–evaluate–iterate loop and identifies three core challenges: learning from prior designs, diversity and complementarity among generated agents, and economic viability. It compares three context-curation strategies—Cumulative, Parallel, and Evolutionary—across MGSM, DROP, MMLU, and GPQA, using GPT-3.5 for agents and GPT-4o for the meta-agent. The results show that cumulative context underperforms, while evolutionary context can yield up to $+10\%$ gains on MGSM, albeit with reduced behavioral diversity; parallel context often provides better diversity and coverage but not always better performance. The cost analysis reveals a break-even point around $n \approx 15{,}000$ examples for certain datasets, with limited viability across others, highlighting a nuanced balance between performance gains and automation costs. Overall, the paper offers actionable guidance on when automated agent design is advantageous and emphasizes the trade-offs between performance, diversity, and economics in this meta-design paradigm.

Abstract

Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases--specifically, two datasets--the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.

Inefficiencies of Meta Agents for Agent Design

TL;DR

This work investigates automating agent design through meta-agents that follow a sample–evaluate–iterate loop and identifies three core challenges: learning from prior designs, diversity and complementarity among generated agents, and economic viability. It compares three context-curation strategies—Cumulative, Parallel, and Evolutionary—across MGSM, DROP, MMLU, and GPQA, using GPT-3.5 for agents and GPT-4o for the meta-agent. The results show that cumulative context underperforms, while evolutionary context can yield up to gains on MGSM, albeit with reduced behavioral diversity; parallel context often provides better diversity and coverage but not always better performance. The cost analysis reveals a break-even point around examples for certain datasets, with limited viability across others, highlighting a nuanced balance between performance gains and automation costs. Overall, the paper offers actionable guidance on when automated agent design is advantageous and emphasizes the trade-offs between performance, diversity, and economics in this meta-design paradigm.

Abstract

Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases--specifically, two datasets--the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.

Paper Structure

This paper contains 29 sections, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the meta-agent framework. The Meta-Agent iteratively samples and evaluates agents, refining its outputs through a feedback loop. We focus on three key dimensions: (1) learning from previously designed agents; (2) diversity and complementarity of generated agents; and (3) economic viability.
  • Figure 2: Agent Diversity: Cumulative context curation yields lower overall similarity. Parallel context curation produces greater agent diversity compared to evolutionary curation, highlighting an exploration exploitation trade-off. Histograms of agent similarities (row averages of $\mathbf{C}$), excluding agents with zero performance (all-black rows of $\mathbf{S}$ in Figure \ref{['fig:S']}, and corresponding dark blue rows and columns of $\mathbf{C}$ in Figure \ref{['fig:SST']}). Each subplot shows histograms of averaged similarity scores for each agent (x-axis) and their frequency (y-axis) across $3$ runs.
  • Figure 3: Average inference cost per test query: C > E > P > I. For agents in the initial library $F$ (Initial, see Appendix \ref{['apdx:initial-agents']}), agents designed by meta agent with $\phi_C$ (Cumulative), agents designed by meta agent with $\phi_P$ (Parallel) , agents designed by meta agent with $\phi_E$ (Evolutionary). Averaged across all agents from 3 runs.
  • Figure 4: Cost Efficiency: Highest performing agent from the initial library generates the outputs with same total performance at lower cost. Number of questions solved (solid lines) and attempted (dashed lines) versus cost spent for agents with best training set performance. The x-intercept indicates the fixed cost $C_0$ ($0$ for agents in initial library); the slope beyond reflects variable cost per attempt or per solution.
  • Figure 5: Average inference cost per test query of the best agents. For best agent in the initial library $F$ (Initial, see Appendix \ref{['apdx:initial-agents']}), best agent designed by meta agent with $\phi_C$ (Cumulative), best agents designed by meta agent with $\phi_P$ (Parallel) , best agent designed by meta agent with $\phi_E$ (Evolutionary). Averaged across the single best agents from 3 runs. Best agent is selected based on the highest training performance.
  • ...and 7 more figures