GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
Jie He, Jennifer Neville, Mengting Wan, Longqi Yang, Hui Liu, Xiaofeng Xu, Xia Song, Jeff Z. Pan, Pei Zhou
TL;DR
GenTool tackles the challenge of tool generalization in language models by proposing a synthetic-data driven framework that targets two core generalization dimensions: zero-to-one and weak-to-strong. It creates a large, diverse training corpus with synthetic tools and carefully engineered query-tool pairs, and adopts a two-stage fine-tuning regime that first ranks candidate tools by capability and then selects the optimal invocation. Across multiple model architectures (1B–8B), GenTool achieves state-of-the-art tool-selection and invocation performance, significantly outperforming tuning-free baselines and GPT-4o, and shows robust generalization across four evaluation scenarios. Ablation and empirical analyses reveal the ranking component’s critical role and highlight how data composition, tool similarity, and synthetic-data biases influence generalization. The work suggests a scalable path toward robust, real-world tool usage by LLMs, while acknowledging limitations related to model scale and single-query/single-tool scope.
Abstract
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
