Understanding Tool-Integrated Reasoning
Heng Lin, Zhongwen Xu
TL;DR
This work provides the first formal theory explaining why Tool-Integrated Reasoning (TIR) enhances LLM capabilities, proving that external tools expand both empirical and feasible support beyond pure-text limits. It introduces token efficiency to show that, under a finite token budget, many problem-solving strategies become executable only with tools. To guide tool use without destabilizing training, it presents Advantage Shaping Policy Optimization (ASPO), a stable method that biases the policy toward earlier tool invocation. Empirically, TIR with a Python interpreter markedly surpasses pure-text models on challenging math benchmarks, and emergent cognitive patterns reveal sophisticated ways models combine reasoning with computation. The results support a paradigm shift toward coupling LLMs with efficient tools to unlock powerful, generalizable reasoning capabilities.
Abstract
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
