Table of Contents
Fetching ...

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

Yifei Chen, Guanting Dong, Zhicheng Dou

TL;DR

Tool-Light addresses the inefficiencies of Tool-Integrated Reasoning by leveraging information-entropy dynamics to guide data sampling and by employing a two-stage self-evolved training pipeline. It introduces entropy-guided sampling to produce diverse TIR trajectories and a self-evolved Direct Preference Optimization loop with pre-aligned and iterative alignment stages. Across 10 datasets spanning mathematical reasoning and knowledge-intensive tasks, Tool-Light achieves higher tool-use efficiency, reduces unnecessary tool calls, and maintains high answer quality, demonstrating strong generalization to multi-tool settings. The work offers practical data-construction and training strategies to stabilize and optimize tool-based reasoning in real-world settings.

Abstract

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

TL;DR

Tool-Light addresses the inefficiencies of Tool-Integrated Reasoning by leveraging information-entropy dynamics to guide data sampling and by employing a two-stage self-evolved training pipeline. It introduces entropy-guided sampling to produce diverse TIR trajectories and a self-evolved Direct Preference Optimization loop with pre-aligned and iterative alignment stages. Across 10 datasets spanning mathematical reasoning and knowledge-intensive tasks, Tool-Light achieves higher tool-use efficiency, reduces unnecessary tool calls, and maintains high answer quality, demonstrating strong generalization to multi-tool settings. The work offers practical data-construction and training strategies to stabilize and optimize tool-based reasoning in real-world settings.

Abstract

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.

Paper Structure

This paper contains 30 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The entropy distribution of tool-integrated reasoning inference.
  • Figure 2: The overall process of entropy-guided sampling. Gray and red positions represent tool calls and the branching positions, respectively.
  • Figure 4: The Efficiency, Necessity and Sequence Length differences between Tool Light and baseline methods.
  • Figure 5: Output sequences' entropy distribution under different methods.
  • Figure 6: The variation of metrics with changes in the data ratio.