Table of Contents
Fetching ...

Efficient Benchmarking of AI Agents

Franck Ndzomga

Abstract

Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full-benchmark evaluation.

Efficient Benchmarking of AI Agents

Abstract

Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full-benchmark evaluation.
Paper Structure (35 sections, 7 equations, 7 figures, 5 tables)

This paper contains 35 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The Robustness Gap. Spearman $\rho$ (ranking) and $R^2$ (score prediction) across evaluation regimes. Ranking fidelity remains robust even as absolute score prediction rapidly degrades under temporal, scaffold and random shift.
  • Figure 2: Range of Performance by Selection Strategy. Mean, best-case, and worst-case Spearman $\rho$ across overlapping benchmarks and evaluation protocols. Mid-Range task selection offers the overall best ranking preservation across benchmarks.
  • Figure 3: Performance Stability Across Distribution Shifts. Average Spearman $\rho$ by task selection strategy, disaggregated across four evaluation protocols (LOAO, LOSO, Random 20%, and Temporal shift). Mid-Range selection maintains reliable rankings ($>0.85$ mean $\rho$) regardless of the evaluation regime.
  • Figure 4: The MR--Easiest performance gap correlates with task-set overlap ($r = -0.71$, $p = 0.048$): Easiest-$k$ is competitive when it selects the same tasks as MR. USACO (14% overlap) is the benchmark where the two sets genuinely differ, and it is where MR most clearly wins ($\Delta\rho = 0.078$ under distribution shift).
  • Figure 5: Post-hoc sensitivity analysis of mid-range bands. The average Spearman rank correlation $\rho$ (blue, left axis) and average task reduction percentage (red, right axis) across seven benchmarks for increasingly narrow mid-range bands. Shaded regions denote $\pm 1$ standard deviation across the benchmarks. The 30--70 band, chosen a priori, strikes a highly favorable balance.
  • ...and 2 more figures