Table of Contents
Fetching ...

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen

TL;DR

ToolHop addresses the challenge of evaluating multi-hop tool use by LLMs by introducing a query-driven dataset of 995 multi-hop queries and 3,912 locally executable tools. The dataset is built through a three-stage pipeline—Tool Creation, Document Refinement, and Code Generation—to ensure diverse, interdependent tools with verifiable answers. Evaluations on 14 LLMs from five families reveal that even the best model (GPT-4o) achieves only $\$49.04\%$ answer accuracy in mandatory tool-use scenarios, highlighting substantial room for improvement and revealing distinct tool-use patterns across model families. The work also demonstrates the value of rich tool feedback and detailed error handling for correcting tool-use behavior, and it provides practical recommendations for building more capable tool-using systems. Code and data are publicly available at HuggingFace.

Abstract

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

TL;DR

ToolHop addresses the challenge of evaluating multi-hop tool use by LLMs by introducing a query-driven dataset of 995 multi-hop queries and 3,912 locally executable tools. The dataset is built through a three-stage pipeline—Tool Creation, Document Refinement, and Code Generation—to ensure diverse, interdependent tools with verifiable answers. Evaluations on 14 LLMs from five families reveal that even the best model (GPT-4o) achieves only 49.04\%$ answer accuracy in mandatory tool-use scenarios, highlighting substantial room for improvement and revealing distinct tool-use patterns across model families. The work also demonstrates the value of rich tool feedback and detailed error handling for correcting tool-use behavior, and it provides practical recommendations for building more capable tool-using systems. Code and data are publicly available at HuggingFace.

Abstract

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
Paper Structure (35 sections, 10 figures, 11 tables)

This paper contains 35 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: An illustration of multi-hop tool use. The process entails decomposing a complex multi-hop query into multiple atomic sub-queries, sequentially invoking the appropriate tools, retrieving results from the tool feedback, and iterating until the final answer is derived. This demonstrates the integration of comprehension, reasoning, and function-calling capabilities.
  • Figure 2: An illustration of our proposed query-driven data construction scheme, comprising three key processes: tool creation, document refinement, and code generation. This approach incrementally develops detailed tool document and code implementation for each atomic subquery within a multi-hop query.
  • Figure 3: Distribution of user queries across 47 domains in the ToolHop dataset.
  • Figure 4: Distribution of the number of tool parameters before and after document refinement.
  • Figure 5: Distribution of tool parameter types before and after document refinement.
  • ...and 5 more figures