TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation
Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, Shuo Shang
TL;DR
The paper addresses the problem of coarse-grained alignment in tool-using LLMs by introducing TTPA, a framework that trains models using token-level preferences and an error-oriented signal. It combines Reversed Dataset Construction with Token-level Preference Sampling and an Error-oriented Scoring Mechanism to create high-quality, fine-grained training data and precise rewards for tool calls, enabling more reliable tool usage. Empirical results on ToolBench, BFCL, and a custom test set show that TTPA improves tool selection, parameter filling, and value parsing, while demonstrating strong generalization across models and datasets. The work advances practical tool integration for LLMs and provides a rigorous, tunable approach to token-level tool-use alignment with potential impact on real-world AI assistants and automation pipelines.
Abstract
Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.
