Table of Contents
Fetching ...

Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong

TL;DR

This work introduces General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains and finds that neither scaling methodology yields effective performance improvements in practice.

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

Benchmark Test-Time Scaling of General LLM Agents

TL;DR

This work introduces General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains and finds that neither scaling methodology yields effective performance improvements in practice.

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.
Paper Structure (59 sections, 1 equation, 11 figures, 9 tables)

This paper contains 59 sections, 1 equation, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Evaluating general LLM agents under a realistic user-interaction scenario.A: GPT-5's performance drop under General AgentBench compared to static, domain-specified evaluation. B: Sequential test-time scaling via longer interaction histories can lead to unstable or degraded performance. C: While correct solutions increasingly appear in the generation space (past@$K$), agents often fail to select them, revealing a verification gap.
  • Figure 2: Illustration of how General AgentBench covers a wide range of task categories while providing a unified interface to simulate real-world user interactions. The green region indicates the specific task currently being handled by the agent (e.g., a search task). Orange boxes denote other clients and servers that remain active and responsive but are not directly involved in the current interaction. Red indicates that other domain-specific data are excluded.
  • Figure 3: Relative performance change across domains from the Baseline ($B$) specialized agent setting to the general agent ($G$) setting with unified context and tools. Negative values indicate performance degradation under the General AgentBench.
  • Figure 4: Performance comparison between specialized-agent and general-agent settings.Top: Absolute performance .Bottom: Relative performance degradation under the general-agent setting.
  • Figure 5: Test-time scaling behaviors of general LLM agents. Results are reported for five models across four domains on General AgentBench. Top: Parallel scaling expands the solution space through increased sampling. Bottom: Sequential scaling allocates additional computation via longer interaction histories, yet exhibiting unstable or diminishing returns.
  • ...and 6 more figures