AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan; Xuyan Ye; Yupeng Huo; Zhi-Yuan Chen; Yiju Guo; Shenzhi Yang; Wenkai Yang; Shuqi Ye; Jingwen Chen; Haotian Chen; Xin Cong; Yankai Lin

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin

Abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Abstract

Paper Structure (28 sections, 2 equations, 9 figures, 6 tables)

This paper contains 28 sections, 2 equations, 9 figures, 6 tables.

Introduction
Related Work
LLM Agents
Reward Benchmarks.
Benchmark Construction
Evaluation Protocol
Data Collection
Task Curation
Trajectory Generation
Expert Annotation
Statistics
Evaluation
Setup
Evaluated LLMs
Metrics
...and 13 more sections

Figures (9)

Figure 1: Comparison of step accuracy across 20 LLMs on AgentProcessBench (%).
Figure 2: Example of an agent trajectory with human annotated step labels. Each instance in AgentProcessBench consists of a complete tool-using agent trajectory, containing interleaved user messages, assistant responses, and tool calls. During evaluation, the LLM is tasked with annotating each of the assistant’s steps with a label of correct (+1), neutral (0), or incorrect (-1).
Figure 3: An overview of AgentProcessBench. First, we sample trajectories from four representative agent benchmarks generated by five source models. Subsequently, human experts annotate the data via a specialized platform, achieving an inter-annotator agreement of 89.1%. Finally, we utilize the constructed benchmark to evaluate 20 distinct models across various families and parameter scales using the StepAcc and FirstErrAcc metrics.
Figure 4: Distribution of trajectory-level and step-level labels across models, where both Qwen-series models use the 2507 Instruct version.
Figure 5: Distribution of first error positions (indexed from 0).
...and 4 more figures

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Abstract

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Authors

Abstract

Table of Contents

Figures (9)