Table of Contents
Fetching ...

Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

Yirong Zeng, Xiao Ding, Yutai Hou, Yuxian Wang, Li Du, Juyi Dai, Qiuyang Ding, Duyu Tang, Dandan Tu, Weiwen Liu, Bing Qin, Ting Liu

TL;DR

The paper tackles the generalization gap in tool-augmented LLMs trained with supervised data by adopting a pure RL approach that scales from Zero models. It introduces GG-GRPO, a dynamic generalization-guided reward design that shifts from broad exploration to strict, AST-based tool integration, and trains Tool-Zero models (7B/32B) to autonomously utilize general tools. Across BFCL and related benchmarks, Tool-Zero consistently outperforms SFT and RL-with-SFT baselines, demonstrating robust cross-dataset and intra-dataset generalization and confirming RL’s ability to elicit intrinsic reasoning for open-domain tool use. This approach reduces reliance on task-specific data and offers a scalable path toward versatile, tool-augmented AI agents with potential impact on real-world automated reasoning and tool integration tasks.

Abstract

Training tool-augmented LLMs has emerged as a promising approach to enhancing language models' capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.

Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

TL;DR

The paper tackles the generalization gap in tool-augmented LLMs trained with supervised data by adopting a pure RL approach that scales from Zero models. It introduces GG-GRPO, a dynamic generalization-guided reward design that shifts from broad exploration to strict, AST-based tool integration, and trains Tool-Zero models (7B/32B) to autonomously utilize general tools. Across BFCL and related benchmarks, Tool-Zero consistently outperforms SFT and RL-with-SFT baselines, demonstrating robust cross-dataset and intra-dataset generalization and confirming RL’s ability to elicit intrinsic reasoning for open-domain tool use. This approach reduces reliance on task-specific data and offers a scalable path toward versatile, tool-augmented AI agents with potential impact on real-world automated reasoning and tool integration tasks.

Abstract

Training tool-augmented LLMs has emerged as a promising approach to enhancing language models' capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: A response demonstration of a tool-augmented model trained in SFT paradigm. The model fails to recognize similar but unfamiliar task contexts (e.g., code transpilation), highlighting limited generalization to unseen tool-use scenarios.
  • Figure 2: Intra-Dataset Performance. The improvement on metric (Live) with training-distributed data is significantly greater than that on other metrics. SFT struggles with out-of-distribution generalization in open-domain settings.
  • Figure 3: The overall architecture of GG-GRPO introduces a dynamic generalization-guided reward design for rule-based RL. It progressively shifts the reward mechanism from a fine-grained generic reward to a strict answer correctness reward.
  • Figure 4: Ablation study results for GG-GRPO on BFCL benchmark overall performance.
  • Figure 5: Hyperparameter analysis for progressive reward strategy on BFCL benchmark overall performance.
  • ...and 1 more figures