Table of Contents
Fetching ...

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Dawn Song, Peter Henderson, Yu Su, Percy Liang, Arvind Narayanan

TL;DR

The paper tackles the fragmented and costly evaluation of AI agents across real-world domains. It introduces the Holistic Agent Leaderboard (HAL) harness, a unified, scalable framework for running and logging agent evaluations on many benchmarks and models, with automated log analysis to detect bugs and unsafe behaviors. Key findings include that higher reasoning effort often does not improve accuracy, that agent scaffolds profoundly affect cost and performance, and that log-based analysis reveals issues like shortcuts and data leakage that pure accuracy metrics miss. HAL enables reproducible, cost-aware, cross-domain benchmarking and provides a data-rich resource (2.5B tokens) to study agent behavior, with an aim to shift focus toward reliable real-world performance.

Abstract

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

TL;DR

The paper tackles the fragmented and costly evaluation of AI agents across real-world domains. It introduces the Holistic Agent Leaderboard (HAL) harness, a unified, scalable framework for running and logging agent evaluations on many benchmarks and models, with automated log analysis to detect bugs and unsafe behaviors. Key findings include that higher reasoning effort often does not improve accuracy, that agent scaffolds profoundly affect cost and performance, and that log-based analysis reveals issues like shortcuts and data leakage that pure accuracy metrics miss. HAL enables reproducible, cost-aware, cross-domain benchmarking and provides a data-rich resource (2.5B tokens) to study agent behavior, with an aim to shift focus toward reliable real-world performance.

Abstract

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

Paper Structure

This paper contains 71 sections, 48 figures, 23 tables.

Figures (48)

  • Figure 1: Challenges in evaluating AI agents and how HAL addresses them.
  • Figure 2: Pareto frontier of accuracy and cost (dotted red line). The Pareto frontier captures the models with the best accuracy at a given budget. The three models most commonly on the frontier are Gemini 2.0 Flash (7 of 9 benchmarks), GPT-5 (4 of 9), and o4-mini Low (4 of 9). The model least frequently on the frontier is DeepSeek R1 (0 of 9), followed by Claude-3.7 Sonnet High (1 of 9) and Claude Opus 4.1 and Claude Opus 4.1 High (1 of 8; note that we did not run Opus 4.1 on Online Mind2Web due to budget limits, as we estimated it would cost about $20,000; on this benchmark, we evaluated Sonnet 4 instead.). See Figure \ref{['fig:accuracy_tokens']} in the Appendix for the corresponding Pareto frontier using token counts rather than dollar costs. We plot the convex hull because one can interpolate between agents by randomly selecting between them (e.g., using agent A 30% of the time and agent B 70% of the time to achieve intermediate cost-accuracy points). We include the origin (0,0) since one can always choose not to deploy an agent, achieving zero accuracy at zero cost. Note the non-standard y axes.
  • Figure 3: Effect of higher reasoning on accuracy. We test four model pairs, Sonnet 3.7, Sonnet 4, and Opus 4.1 (no reasoning & high) and o4-mini (low & high), with a given scaffold and benchmark. For 21 of 36 runs, higher reasoning effort does not improve accuracy.
  • Figure 4: Results from Docent rubric analysis of 1,634 transcripts from 36 model-scaffold pairs on AssistantBench (AB), SciCode, and CORE-Bench (CORE). See Table \ref{['tab:failure-prevalence']} and Table \ref{['tab:reliability-rr']} for more detailed results, and Table \ref{['tab:rubrics_abridged']} for rubrics and screenshots of example behaviors.
  • Figure A1: HAL harness architecture. The harness coordinates agent evaluation by (1) accepting benchmark tasks and agent implementations as input, (2) provisioning isolated execution environments (local, Docker, or Azure VMs), (3) capturing all agent interactions through Weave logging, (4) evaluating agent outputs against gold answers, and (5) aggregating results for the public leaderboard. The system automatically handles file transfer and resource cleanup across all execution backends.
  • ...and 43 more figures