Table of Contents
Fetching ...

GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

Jie He, Jennifer Neville, Mengting Wan, Longqi Yang, Hui Liu, Xiaofeng Xu, Xia Song, Jeff Z. Pan, Pei Zhou

TL;DR

GenTool tackles the challenge of tool generalization in language models by proposing a synthetic-data driven framework that targets two core generalization dimensions: zero-to-one and weak-to-strong. It creates a large, diverse training corpus with synthetic tools and carefully engineered query-tool pairs, and adopts a two-stage fine-tuning regime that first ranks candidate tools by capability and then selects the optimal invocation. Across multiple model architectures (1B–8B), GenTool achieves state-of-the-art tool-selection and invocation performance, significantly outperforming tuning-free baselines and GPT-4o, and shows robust generalization across four evaluation scenarios. Ablation and empirical analyses reveal the ranking component’s critical role and highlight how data composition, tool similarity, and synthetic-data biases influence generalization. The work suggests a scalable path toward robust, real-world tool usage by LLMs, while acknowledging limitations related to model scale and single-query/single-tool scope.

Abstract

Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.

GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

TL;DR

GenTool tackles the challenge of tool generalization in language models by proposing a synthetic-data driven framework that targets two core generalization dimensions: zero-to-one and weak-to-strong. It creates a large, diverse training corpus with synthetic tools and carefully engineered query-tool pairs, and adopts a two-stage fine-tuning regime that first ranks candidate tools by capability and then selects the optimal invocation. Across multiple model architectures (1B–8B), GenTool achieves state-of-the-art tool-selection and invocation performance, significantly outperforming tuning-free baselines and GPT-4o, and shows robust generalization across four evaluation scenarios. Ablation and empirical analyses reveal the ranking component’s critical role and highlight how data composition, tool similarity, and synthetic-data biases influence generalization. The work suggests a scalable path toward robust, real-world tool usage by LLMs, while acknowledging limitations related to model scale and single-query/single-tool scope.

Abstract

Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.

Paper Structure

This paper contains 61 sections, 9 equations, 7 figures, 23 tables.

Figures (7)

  • Figure 1: An example illustrating tool generalization challenges in selecting the most suitable tool for a user query. While the model was trained on tools like Yelp and Web-Search, and encountered the same query: "Could you ... in New York City?", it struggles during testing to select the more appropriate Yelp tool over Web-Search for the same query during test.
  • Figure 2: The construction process for synthetic data to simulate the generalization process involves three steps: First, existing datasets are utilized to create new tools. Next, diverse instructions guide the generation of matching queries for these new tools. Finally, corresponding invocation details are created for various tool-query combinations.
  • Figure 3: Overview of the GenTool Framework for Tool Learning and Generalization. Initially, the model handles a query by defaulting to generate_response when no suitable tool is available. Next, when a relevant tool, web_search, is added to the toolset, the model selects web_search, demonstrating zero-to-one generalization training. Later, upon adding map_search, the model demonstrates weak-to-strong generalization by correctly ranking and selecting it over web_search and other alternatives.
  • Figure 4: Detailed results for different test set scenarios. GenTool consistently outperforms all baselines.
  • Figure 5: The relationship between the number of training examples related to the test set's gold tool and the test set's accuracy. The blue line represents seen tools, while the purple line denotes unseen tools.
  • ...and 2 more figures