ToolGen: Unified Tool Retrieval and Calling via Generation

Renxi Wang; Xudong Han; Lei Ji; Shu Wang; Timothy Baldwin; Haonan Li

ToolGen: Unified Tool Retrieval and Calling via Generation

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, Haonan Li

TL;DR

ToolGen addresses the scalability challenge of tool use by LLMs by embedding real-world tools as unique tokens in the model's vocabulary, turning tool retrieval and invocation into a single generative process. It introduces a three-stage training pipeline—tool memorization, retrieval training, and end-to-end agent-tuning—and uses atomic indexing to map each tool to a single token. On ToolBench's 47k-tool dataset, ToolGen achieves competitive tool retrieval and superior end-to-end task completion with lower latency and cost, demonstrating robustness in multi-domain settings. The approach also leverages constrained decoding to mitigate hallucinations and outlines future directions for integrating chain-of-thought and reinforcement learning to further enhance autonomous tool use.

Abstract

As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM's parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation. Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains. By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs.

ToolGen: Unified Tool Retrieval and Calling via Generation

TL;DR

Abstract

Paper Structure (47 sections, 2 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 47 sections, 2 equations, 10 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Tool Retrieval
LLM-Agents with Tool Calling
ToolGen
Preliminaries
Tool Virtualization
Tool Memorization
Retrieval Training
End-to-End Agent-Tuning
Inference
Tool Retrieval Evaluation
Experimental Setup
Dataset
Baselines
...and 32 more sections

Figures (10)

Figure 1: Comparison between previous retrieval-based methods and our ToolGen. Previous methods use a retriever to retrieve relevant tools based on similarity matching, which are further put into prompts for LLMs to select. ToolGen can retrieve tools by generating tool tokens directly. ToolGen can also complete the task without relying on any external retriever.
Figure 2: An illustration of ToolGen framework. In tool virtualization, tools are mapped into virtual tokens. In the following three-stage training, ToolGen first memorizes tools by predicting tool tokens based on their documentations. Then it learns to retrieve tools by predicting tool tokens from queries. Finally, pipeline data, i.e., trajectories, are used to finetune the retriever model from the last stage, resulting in the ToolGen Agent model.
Figure 3: The distribution of the number of subtokens per tool varies across different indexing methods.
Figure 4: The hallucination rates of generating nonexistent tools across different models are shown. ToolGen does not generate any nonexistent tools when using constrained decoding. However, without this constraint, ToolGen generates 7% non-tool tokens during the Action generation stage with atomic indexing, and even more with semantic indexing. For ToolLlama and GPT-3.5, despite being provided with five ground truth tools in the prompt, hallucinations still occur. Without any tools specified in the prompt, ToolLlama generates over 50% nonexistent tool names.
Figure 5: Real examples from ToolkenGPT, Toolformer, and ToolGen (ours). Both ToolkenGPT and Toolformer describe tools available in the prompt, while ToolGen does not require tools been mentioned in its prompt.
...and 5 more figures

ToolGen: Unified Tool Retrieval and Calling via Generation

TL;DR

Abstract

ToolGen: Unified Tool Retrieval and Calling via Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)