Table of Contents
Fetching ...

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren

TL;DR

This work introduces ToolRet, the first large-scale benchmark for tool retrieval in large language model (LLM) tool-use scenarios, pairing 7.6k tasks with 43k tools across Web APIs, code functions, and customized apps. It demonstrates that state-of-the-art information retrieval models underperform on tool retrieval, with low completeness and recall, due to low query-tool lexical overlap and domain shifts. To address this, the authors provide ToolRet-train, a 200k+ example training dataset with target-aware instructions, which yields substantial gains in retrieval quality and improves end-to-end task pass rates for tool-use LLMs. The work also discusses the need for richer evaluation axes and outlines future directions, including multimodal extensions and deeper instruction sensitivity analyses, to enable more reliable and scalable tool-augmented reasoning systems.

Abstract

Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

TL;DR

This work introduces ToolRet, the first large-scale benchmark for tool retrieval in large language model (LLM) tool-use scenarios, pairing 7.6k tasks with 43k tools across Web APIs, code functions, and customized apps. It demonstrates that state-of-the-art information retrieval models underperform on tool retrieval, with low completeness and recall, due to low query-tool lexical overlap and domain shifts. To address this, the authors provide ToolRet-train, a 200k+ example training dataset with target-aware instructions, which yields substantial gains in retrieval quality and improves end-to-end task pass rates for tool-use LLMs. The work also discusses the need for richer evaluation axes and outlines future directions, including multimodal extensions and deeper instruction sensitivity analyses, to enable more reliable and scalable tool-augmented reasoning systems.

Abstract

Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

Paper Structure

This paper contains 45 sections, 1 equation, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: Correlation between the tool retrieval performance (e.g., Recall@10) of IR models and the end-to-end task pass rate of tool-use agents.
  • Figure 2: ROUGE-L overlap between the query (input) and the target tools (label).
  • Figure 3: Length distribution of our benchmark.
  • Figure 4: ROUGE-L overlap between the handcrafted seed instructions and model-generated instructions.
  • Figure 5: Correlation between the score on our benchmark and MTEB (retrieval subset).
  • ...and 2 more figures