Table of Contents
Fetching ...

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Pengfei He, Zhenwei Dai, Bing He, Hui Liu, Xianfeng Tang, Hanqing Lu, Juanhui Li, Jiayuan Ding, Subhabrata Mukherjee, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin

TL;DR

TRAJECT-Bench addresses the need for trajectory-aware evaluation of LLM-based agentic tool use by introducing a large-scale benchmark with diverse, executable tools and task-driven trajectories. It models both parallel and sequential tool structures and introduces trajectory-level metrics (EM, Inclusion, Usage, Traj-Satisfy) in addition to final accuracy, enabling precise diagnostics of tool selection, parameterization, and ordering. Empirical results reveal scaling and failure modes such as tool confusion and parameter-blind selection, and show that agentic approaches like ReAct can improve tool use, especially with retrieval mechanisms. The released dataset and code offer a foundation for systematic improvements in planning, selecting, and executing tools in complex, real-world scenarios.

Abstract

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

TL;DR

TRAJECT-Bench addresses the need for trajectory-aware evaluation of LLM-based agentic tool use by introducing a large-scale benchmark with diverse, executable tools and task-driven trajectories. It models both parallel and sequential tool structures and introduces trajectory-level metrics (EM, Inclusion, Usage, Traj-Satisfy) in addition to final accuracy, enabling precise diagnostics of tool selection, parameterization, and ordering. Empirical results reveal scaling and failure modes such as tool confusion and parameter-blind selection, and show that agentic approaches like ReAct can improve tool use, especially with retrieval mechanisms. The released dataset and code offer a foundation for systematic improvements in planning, selecting, and executing tools in complex, real-world scenarios.

Abstract

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.

Paper Structure

This paper contains 19 sections, 2 figures, 11 tables.

Figures (2)

  • Figure 1: An illustration of data in TRAJECT-Bench. The left side demonstrates the parallel query where tool trajectories are created based on real task types and then queries with two difficulty level are generated. The right side shows the generation process of sequential queries, where a tool graph is first built, then task sequences are manually designed and finally detailed queries and trajectories are created.
  • Figure 2: Figures illustrating model's scaling tool-use behavior. The x-axis denote the number of tools in the trajectory and y-axis denotes the metric EM. Left is for simple queries, while right is for hard queries.