Table of Contents
Fetching ...

TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

Chengrui Huang, Shen Gao, Zhengliang Shi, Dongsheng Wang, Shuo Shang

TL;DR

The paper addresses the problem of coarse-grained alignment in tool-using LLMs by introducing TTPA, a framework that trains models using token-level preferences and an error-oriented signal. It combines Reversed Dataset Construction with Token-level Preference Sampling and an Error-oriented Scoring Mechanism to create high-quality, fine-grained training data and precise rewards for tool calls, enabling more reliable tool usage. Empirical results on ToolBench, BFCL, and a custom test set show that TTPA improves tool selection, parameter filling, and value parsing, while demonstrating strong generalization across models and datasets. The work advances practical tool integration for LLMs and provides a rigorous, tunable approach to token-level tool-use alignment with potential impact on real-world AI assistants and automation pipelines.

Abstract

Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.

TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

TL;DR

The paper addresses the problem of coarse-grained alignment in tool-using LLMs by introducing TTPA, a framework that trains models using token-level preferences and an error-oriented signal. It combines Reversed Dataset Construction with Token-level Preference Sampling and an Error-oriented Scoring Mechanism to create high-quality, fine-grained training data and precise rewards for tool calls, enabling more reliable tool usage. Empirical results on ToolBench, BFCL, and a custom test set show that TTPA improves tool selection, parameter filling, and value parsing, while demonstrating strong generalization across models and datasets. The work advances practical tool integration for LLMs and provides a rigorous, tunable approach to token-level tool-use alignment with potential impact on real-world AI assistants and automation pipelines.

Abstract

Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.

Paper Structure

This paper contains 35 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The overall framework of our work, which mainly consists of Preference Oriented Tool-use Dataset Construction and Error-oriented Scoring Mechanism.
  • Figure 2: Error types of tool calls. Example column presents the examples of different error types. Reason column presents the reason why the example failed.
  • Figure 3: The results of evaluation on the general datasets.
  • Figure 4: The case study of BFCL. TTPA (Qwen) passes the question but is evaluated as false.
  • Figure 5: The complete example of entire process.