Table of Contents
Fetching ...

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

Abstract

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Abstract

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

Paper Structure

This paper contains 49 sections, 2 equations, 1 figure, 9 tables, 1 algorithm.

Figures (1)

  • Figure 1: Comparison of reward-to-advantage signal across three approaches, shown for a failing rollout ($R{=}0$) with 5 turns. Top row: per-turn reward values; bottom row: resulting training advantage (green = reinforce, red = suppress, gray = zero gradient). (a) GRPO uses outcome-only reward---all turns get the same uniform advantage, providing no credit assignment. (b) MT-GRPO with naïve dense rewards (e.g., read-only$=$0.3) suffers from advantage misalignment: the outcome advantage $A^O$ overwhelms small per-turn advantages, causing necessary read-only turns to be suppressed (red box). (c) Our IRC method calibrates rewards using discriminative analysis (right panel): read-only gets $r{=}0$ (non-discriminative), focusing gradient entirely on gold actions. GTPO hybrid dampens $A^O$ via $\lambda{=}0.3$, eliminating all advantage mismatches.