Table of Contents
Fetching ...

Understanding Tool-Integrated Reasoning

Heng Lin, Zhongwen Xu

TL;DR

This work provides the first formal theory explaining why Tool-Integrated Reasoning (TIR) enhances LLM capabilities, proving that external tools expand both empirical and feasible support beyond pure-text limits. It introduces token efficiency to show that, under a finite token budget, many problem-solving strategies become executable only with tools. To guide tool use without destabilizing training, it presents Advantage Shaping Policy Optimization (ASPO), a stable method that biases the policy toward earlier tool invocation. Empirically, TIR with a Python interpreter markedly surpasses pure-text models on challenging math benchmarks, and emergent cognitive patterns reveal sophisticated ways models combine reasoning with computation. The results support a paradigm shift toward coupling LLMs with efficient tools to unlock powerful, generalizable reasoning capabilities.

Abstract

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

Understanding Tool-Integrated Reasoning

TL;DR

This work provides the first formal theory explaining why Tool-Integrated Reasoning (TIR) enhances LLM capabilities, proving that external tools expand both empirical and feasible support beyond pure-text limits. It introduces token efficiency to show that, under a finite token budget, many problem-solving strategies become executable only with tools. To guide tool use without destabilizing training, it presents Advantage Shaping Policy Optimization (ASPO), a stable method that biases the policy toward earlier tool invocation. Empirically, TIR with a Python interpreter markedly surpasses pure-text models on challenging math benchmarks, and emergent cognitive patterns reveal sophisticated ways models combine reasoning with computation. The results support a paradigm shift toward coupling LLMs with efficient tools to unlock powerful, generalizable reasoning capabilities.

Abstract

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

Paper Structure

This paper contains 23 sections, 5 theorems, 15 equations, 7 figures, 9 tables.

Key Result

Theorem 3.3

Let $\pi_\theta(y|x)$ be an RLVR-trained policy distribution initialized from a base model with distribution $q(y|x)$. For any prompt $x$, the support of the trained policy is a subset of the support of the base model: This implies that if $q(y^*|x) = 0$ for a correct trajectory $y^*$, then RLVR can never discover $y^*$.

Figures (7)

  • Figure 1: The (a) training and (b) testing accuracy of the TIR and pure-text RL on Qwen3-8B model. The AIME25 accuracy (b) is the average of 16 responses.
  • Figure 2: Pass@$k$ curves for the TIR (RL trained) and pure-text models (Qwen3 8B) across three benchmarks: (a) AIME24, (b) AIME25, and (c) Omni-MATH-512. The detailed numerical data corresponding to this figure are provided in the Appendix \ref{['sec:appendix_pass_at_k_data']}.
  • Figure 3: The flow of problem solvability on Omni-MATH-512 when transitioning from the pure-text model to the TIR model, evaluated at $k=256$.
  • Figure 4: (a)-(e) Pass@k curves for the TIR and pure-text models, grouped by problem algo friendliness. (f) The distribution of algo friendliness scores across the Omni-MATH-512 dataset. The problems are categorized into five groups based on their algo friendliness scores: 1.0–1.5 (G1), 2.0–2.5 (G2), 3.0–3.5 (G3), 4.0–4.5 (G4), and 5.0 (G5).
  • Figure 5: The (a) training and (b) testing accuracy of the baseline and ASPO algorithm.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 3.1: Support of a Model (adapted from wu2025invisible)
  • Definition 3.2: Empirical Support (from wu2025invisible)
  • Theorem 3.3: Support Preservation under RLVR (from wu2025invisible)
  • Theorem 3.4: Strict Expansion of Empirical Support via Tool Integration
  • proof
  • Definition 3.5: Computational Equivalence Class
  • Definition 3.6: Feasible Support under Budget $B$
  • Theorem 3.7: Strict Supremacy of Tool-Integrated Feasible Support
  • proof
  • Proposition A.1: Informal; External State as Unbounded Scratchpad
  • ...and 1 more