Table of Contents
Fetching ...

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An

TL;DR

SimpleTIR tackles instability in end-to-end reinforcement learning for multi-turn Tool-Integrated Reasoning by filtering out trajectories with void turns to prevent gradient explosions caused by distributional drift from external tool feedback. It embeds a hierarchical MDP with GRPO-based joint policy optimization and feedback masking, enabling stable Zero RL without cold-start data while encouraging diverse reasoning strategies. Empirically, starting from unaligned Qwen bases, SimpleTIR achieves state-of-the-art AIME24 performance and demonstrates robust training dynamics, self-correction, and cross-validation behaviors. This plug-and-play stabilization technique offers a scalable path toward reliable multi-turn LLM agents that reason with external tools.

Abstract

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

TL;DR

SimpleTIR tackles instability in end-to-end reinforcement learning for multi-turn Tool-Integrated Reasoning by filtering out trajectories with void turns to prevent gradient explosions caused by distributional drift from external tool feedback. It embeds a hierarchical MDP with GRPO-based joint policy optimization and feedback masking, enabling stable Zero RL without cold-start data while encouraging diverse reasoning strategies. Empirically, starting from unaligned Qwen bases, SimpleTIR achieves state-of-the-art AIME24 performance and demonstrates robust training dynamics, self-correction, and cross-validation behaviors. This plug-and-play stabilization technique offers a scalable path toward reliable multi-turn LLM agents that reason with external tools.

Abstract

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

Paper Structure

This paper contains 33 sections, 1 theorem, 3 equations, 6 figures, 7 tables.

Key Result

Proposition 3.1

Consider a token $c$ at timestep $t$ of a trajectory $o_i$. The L2 norm of the policy gradient with respect to the logits $\boldsymbol{z}_t$ is: where $m_{i,t}$ is the feedback mask, $\rho_{i,t}(\theta)$ is the importance ratio, $|\hat{A}_i|$ is the absolute advantage, $P$ is the policy's probability distribution $\pi_\theta(\cdot|o_{i,<t})$, and $g_{i,t}$ is a gating function active when the PPO

Figures (6)

  • Figure 1: Starting from Qwen2.5-7B base model, The training dynamics of SimpleTIR are highly stable, and it clearly outperforms the baseline method without TIR (DAPO). The gradient norm remains well-behaved with almost no spikes. In contrast, Naive Multi-turn Training not only suffers from unstable dynamics and catastrophic gradient norm explosions, but also fails to match the performance of the baseline without TIR.
  • Figure 2: Training statistics comparing naive single-turn and multi-turn TIR. Single-turn training proceeds smoothly and achieves higher performance, while multi-turn training is unstable.
  • Figure 3: Visualization of token probabilities in a multi-turn TIR trajectory. The y-axis is log-scaled. Distributional drift from tool feedback in early turns leads to a collapse in token probabilities in later turns.
  • Figure 4: An overview of SimpleTIR. During the policy update, SimpleTIR identifies and filters out entire trajectories that contain a void turn—an LLM response that fails to produce either a complete code block or a final answer.
  • Figure 5: Top: Training curves for SimpleTIR with different maximum number of turns. SimpleTIR with maximum 10 turns is resumed at 200 steps from SimpleTIR with maximum 5 turns. SimpleTIR clearly benefits from scaling interaction turns from 1 to 5. Bottom: The training curves for ablation studies in the first 320 steps. Trajectory filtering with high importance ratios or low probability tokens cannot resolve the challenge of training instability, while SimpleTIR suffers less from low probability tokens and gradient explosion.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 3.1