ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Junjie Ye; Zhengyin Du; Xuesong Yao; Weijian Lin; Yufei Xu; Zehui Chen; Zaiyuan Wang; Sining Zhu; Zhiheng Xi; Siyu Yuan; Tao Gui; Qi Zhang; Xuanjing Huang; Jiecao Chen

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen

TL;DR

ToolHop addresses the challenge of evaluating multi-hop tool use by LLMs by introducing a query-driven dataset of 995 multi-hop queries and 3,912 locally executable tools. The dataset is built through a three-stage pipeline—Tool Creation, Document Refinement, and Code Generation—to ensure diverse, interdependent tools with verifiable answers. Evaluations on 14 LLMs from five families reveal that even the best model (GPT-4o) achieves only $\$49.04\%$ answer accuracy in mandatory tool-use scenarios, highlighting substantial room for improvement and revealing distinct tool-use patterns across model families. The work also demonstrates the value of rich tool feedback and detailed error handling for correcting tool-use behavior, and it provides practical recommendations for building more capable tool-using systems. Code and data are publicly available at HuggingFace.

Abstract

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

TL;DR

49.04\%$ answer accuracy in mandatory tool-use scenarios, highlighting substantial room for improvement and revealing distinct tool-use patterns across model families. The work also demonstrates the value of rich tool feedback and detailed error handling for correcting tool-use behavior, and it provides practical recommendations for building more capable tool-using systems. Code and data are publicly available at HuggingFace.

Abstract

Paper Structure (35 sections, 10 figures, 11 tables)

This paper contains 35 sections, 10 figures, 11 tables.

Introduction
ToolHop
Task Formulation
Query-Driven Data Construction
Tool Creation
Document Refinement
Code Generation
Dataset Construction
Dataset Analysis
Diverse Queries
Meaningful Interdependencies
Locally Executable Tools
Detailed Feedback
Verifiable Answers
Experimental Setup
...and 20 more sections

Figures (10)

Figure 1: An illustration of multi-hop tool use. The process entails decomposing a complex multi-hop query into multiple atomic sub-queries, sequentially invoking the appropriate tools, retrieving results from the tool feedback, and iterating until the final answer is derived. This demonstrates the integration of comprehension, reasoning, and function-calling capabilities.
Figure 2: An illustration of our proposed query-driven data construction scheme, comprising three key processes: tool creation, document refinement, and code generation. This approach incrementally develops detailed tool document and code implementation for each atomic subquery within a multi-hop query.
Figure 3: Distribution of user queries across 47 domains in the ToolHop dataset.
Figure 4: Distribution of the number of tool parameters before and after document refinement.
Figure 5: Distribution of tool parameter types before and after document refinement.
...and 5 more figures

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

TL;DR

Abstract

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Authors

TL;DR

Abstract

Table of Contents

Figures (10)