Table of Contents
Fetching ...

ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

Fu Chen, Peng Wang, Xiyin Li, Wen Li, Shichi Lei, Dongdong Xiang

TL;DR

ToolExpander tackles training instability and data inefficiency in GRPO when applying tool-using reinforcement learning to small LLMs. It introduces Dynamic Multi-Round Hard Sampling to replace hard samples with high-quality few-shot data and Self-Exemplifying Thinking to enable autonomous generation and analysis of few-shot contexts, while discarding KL divergence in the core objective. The approach yields improved stability and tool-using accuracy on benchmarks like BFCL and APIBank, with notable gains for 1.5B models and measurable gains from self-generated reasoning. These techniques enhance data utilization and learning efficiency, enabling stronger performance in resource-constrained LLMs without relying on heavy reward engineering or large-scale prompts.

Abstract

Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.

ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

TL;DR

ToolExpander tackles training instability and data inefficiency in GRPO when applying tool-using reinforcement learning to small LLMs. It introduces Dynamic Multi-Round Hard Sampling to replace hard samples with high-quality few-shot data and Self-Exemplifying Thinking to enable autonomous generation and analysis of few-shot contexts, while discarding KL divergence in the core objective. The approach yields improved stability and tool-using accuracy on benchmarks like BFCL and APIBank, with notable gains for 1.5B models and measurable gains from self-generated reasoning. These techniques enhance data utilization and learning efficiency, enabling stronger performance in resource-constrained LLMs without relying on heavy reward engineering or large-scale prompts.

Abstract

Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.

Paper Structure

This paper contains 19 sections, 6 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The Overall Framework of ToolExpander
  • Figure 2: Hard Samples Count: Few-shots vs. increasing the number of rollouts, rollout number is 10 for few-shots scenarios, rollout number is 32 for non-few-shots scenarios
  • Figure 3: Fluctuation of Hard Sample Count During Training
  • Figure 4: Fully Correct Reward Results of few-shot Generated by the Model During Training
  • Figure 5: Accuracy Comparisons Across Different Model Configurations and Training Strategies. (a) Based on Data from BFCL List as of 2025-08-26 (b) Based on data from ToolRL:Reward is all you needtoolrl
  • ...and 4 more figures