iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Yirong Zeng; Xiao Ding; Yuxian Wang; Weiwen Liu; Wu Ning; Yutai Hou; Xu Huang; Duyu Tang; Dandan Tu; Bing Qin; Ting Liu

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu

TL;DR

This paper tackles the problem that synthetic data for tool-use in LLMs yields diminishing returns due to fragment-level parameter errors. It introduces iTool, an iterative reinforced fine-tuning framework that combines MCTS-based path exploration for diverse responses with fine-grained preference data and direct preference optimization to correct fragment errors. Warm-up curriculum learning paired with an iterative reinforcement loop enables the model to improve tool-use performance, especially in complex scenarios, achieving notable gains over a same-size base model and outperforming larger models in benchmarks like BFCL and API-Bank. The approach offers a data-efficient, scalable path to enhance advanced tool use in LLMs and highlights the importance of targeted error correction and data diversity in synthetic data training.

Abstract

Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from additional synthetic data, which fails to endow it with advanced tool-use capabilities in complex scenarios Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 11 figures, 12 tables)

This paper contains 30 sections, 11 equations, 11 figures, 12 tables.

Introduction
Problem Statement and Analysis
Task Overview
Preliminary Study
Method
Warm-up training
MCTS-Based Iterative Reinforcement Learning
Experiments
Experimental Setup
Overall Performance
Ablation Analysis
Module Ablation
Deeper Ablation
Base Model Analysis.
Training Gains Analysis
...and 15 more sections

Figures (11)

Figure 1: The training paradigm of the tool-use model under synthetic data (a). However, as shown in (b), the growth rate of the model’s performance gain declines significantly as the training data increases, especially in complex tool-use scenarios.
Figure 2: An illustration of tool-use. Given a user query with candidate tools, LLMs select the tool(s) from candidates, then execute the API call operation, and finally reply with a response. In the bad response, the parameter errors (i.g, red font weather='unknown') account for a small fragment of the response content.
Figure 3: Error type distribution in bad cases. In bad cases, error types are highly concentrated in Parameter Value & Name.
Figure 4: The overall architecture of iTool consists of warm-up training and iterative reinforcement learning. Specifically, after warm-up training ①, the policy model refreshes the replay buffer ② and then actively samples complex data ③. Then, step-wise MCTS ④ is performed to obtain fine-grained preference pairs for pointing out the wrong fragment in response. Finally, the models are updated via direct preference optimization ⑤ to improve response. The fire and frozen denote parameters are updated and fixed, respectively.
Figure 5: The performance progression of easy to hard warm-up training on Live and Overall metrics.
...and 6 more figures

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

TL;DR

Abstract

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Authors

TL;DR

Abstract

Table of Contents

Figures (11)