Table of Contents
Fetching ...

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras

TL;DR

This paper tackles the difficulty of training LLMs for multi-turn Tool-Integrated Reasoning (TIR), where traditional trajectory-level rewards yield weak learning signals. It introduces Group Turn Policy Optimization (GTPO), which shifts to turn-level rewards, employs return-based discounts across turns, and uses self-supervised reward shaping derived from code-content similarities to densify learning signals. Empirical results on diverse math benchmarks show GTPO achieves a 3.0% relative improvement over GRPO, with notable gains in AIME 2024, MATH 500, and SVAMP, and ablations confirm the importance of turn-level rewards, discounting, and code-based shaping. The approach demonstrates that fine-grained, turn-aware RL signals and self-supervised shaping can significantly enhance real-world TIR performance, reducing learning stagnation and improving reasoning reliability in LLMs.

Abstract

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

TL;DR

This paper tackles the difficulty of training LLMs for multi-turn Tool-Integrated Reasoning (TIR), where traditional trajectory-level rewards yield weak learning signals. It introduces Group Turn Policy Optimization (GTPO), which shifts to turn-level rewards, employs return-based discounts across turns, and uses self-supervised reward shaping derived from code-content similarities to densify learning signals. Empirical results on diverse math benchmarks show GTPO achieves a 3.0% relative improvement over GRPO, with notable gains in AIME 2024, MATH 500, and SVAMP, and ablations confirm the importance of turn-level rewards, discounting, and code-based shaping. The approach demonstrates that fine-grained, turn-aware RL signals and self-supervised shaping can significantly enhance real-world TIR performance, reducing learning stagnation and improving reasoning reliability in LLMs.

Abstract

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Tool-integrated reasoning (TIR): Given a problem, the model progresses over multiple turns, where each turn consists of: (1) generating textual reasoning, (2) invoking tools (e.g., code), and (3) incorporating tool execution results to refine its understanding. The model repeats this cycle until a termination condition is met, either by producing a final answer or by reaching a predefined stopping criterion.
  • Figure 2: An overview of GTPO: Unlike existing approaches that rely on trajectory-level rewards, GTPO introduces a turn-level reward function that assigns diverse, rule-based rewards for individual turns within each trajectory and performs turn-level return-based discounting for advantage calculation.
  • Figure 3: GTPO reward shaping strategy: In GTPO, each rollout trajectory is partitioned by final outcome (correct vs. incorrect), and the code content is extracted. For each trajectory in the incorrect group, we compute its average similarity against all samples in the correct group and use the similarity score as its partial reward, so that wrong trajectories can still be properly utilized during training for more learning signals.
  • Figure 4: Qualitative example: We demonstrated an AIME24 example task to compare the distinct coding patterns of GRPO and GTPO. Qwen2.5-7B-Instruct trained with GTPO can write correct code along with accurate tests that thoroughly validate the code correctness, while Qwen2.5-7B-Instruct trained with GRPO fails to solve the problem.
  • Figure 5: Training accuracy curves of GRPO and GTPO under the same experimental setup and training datasets.
  • ...and 3 more figures