ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen
TL;DR
ToolHop addresses the challenge of evaluating multi-hop tool use by LLMs by introducing a query-driven dataset of 995 multi-hop queries and 3,912 locally executable tools. The dataset is built through a three-stage pipeline—Tool Creation, Document Refinement, and Code Generation—to ensure diverse, interdependent tools with verifiable answers. Evaluations on 14 LLMs from five families reveal that even the best model (GPT-4o) achieves only $\$49.04\%$ answer accuracy in mandatory tool-use scenarios, highlighting substantial room for improvement and revealing distinct tool-use patterns across model families. The work also demonstrates the value of rich tool feedback and detailed error handling for correcting tool-use behavior, and it provides practical recommendations for building more capable tool-using systems. Code and data are publicly available at HuggingFace.
Abstract
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
