Table of Contents
Fetching ...

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Peter Mühlbacher, Nikos I. Bosse, Lawrence Phillips

TL;DR

This work presents a realistic, long-term benchmark concept for evaluating open-web research agents on economically impactful, messy tasks. It compares multiple LLMs across four architectures (non-planning vs planning, with/without delegation) using a suite of real-world tasks spanning geopolitics, finance, epidemiology, and forecasting, aided by tools like Google search and a Python REPL. Key findings show Claude-3.5 Sonnet and o1-preview achieving the best average performance, with planning and delegation providing the strongest results in many settings, while weaker models lag notably on complex, open-web reasoning. The study provides both quantitative scores and qualitative traces to illuminate failure modes and to guide future development of agent benchmarks that better reflect real-world impact and continuity across frontier models.

Abstract

We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Our evaluation represents the first in-depth assessment of agents' abilities to conduct challenging, economically valuable analyst-style research on the real open web.

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

TL;DR

This work presents a realistic, long-term benchmark concept for evaluating open-web research agents on economically impactful, messy tasks. It compares multiple LLMs across four architectures (non-planning vs planning, with/without delegation) using a suite of real-world tasks spanning geopolitics, finance, epidemiology, and forecasting, aided by tools like Google search and a Python REPL. Key findings show Claude-3.5 Sonnet and o1-preview achieving the best average performance, with planning and delegation providing the strongest results in many settings, while weaker models lag notably on complex, open-web reasoning. The study provides both quantitative scores and qualitative traces to illuminate failure modes and to guide future development of agent benchmarks that better reflect real-world impact and continuity across frontier models.

Abstract

We present initial results of a forthcoming benchmark for evaluating LLM agents on white-collar tasks of economic value. We evaluate agents on real-world "messy" open-web research tasks of the type that are routine in finance and consulting. In doing so, we lay the groundwork for an LLM agent evaluation suite where good performance directly corresponds to a large economic and societal impact. We built and tested several agent architectures with o1-preview, GPT-4o, Claude-3.5 Sonnet, Llama 3.1 (405b), and GPT-4o-mini. On average, LLM agents powered by Claude-3.5 Sonnet and o1-preview substantially outperformed agents using GPT-4o, with agents based on Llama 3.1 (405b) and GPT-4o-mini lagging noticeably behind. Across LLMs, a ReAct architecture with the ability to delegate subtasks to subagents performed best. In addition to quantitative evaluations, we qualitatively assessed the performance of the LLM agents by inspecting their traces and reflecting on their observations. Our evaluation represents the first in-depth assessment of agents' abilities to conduct challenging, economically valuable analyst-style research on the real open web.
Paper Structure (97 sections, 3 equations, 2 figures, 10 tables)

This paper contains 97 sections, 3 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Performance broken down by LLM and task. For each LLM and each task, we (post-hoc) chose the architecture that performed the best and recorded its score. See Table \ref{['tab:comparison_of_LLMs']} for the exact values.
  • Figure 2: Scores broken down by agents (i.e. architecture-LLM combinations). See Table \ref{['tab:comparison_of_agents']} for the exact values.