Table of Contents
Fetching ...

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun

Abstract

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T$^2$PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T$^2$PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T$^2$PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Abstract

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (TPO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, TPO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, TPO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate TPO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

Paper Structure

This paper contains 43 sections, 17 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Training instability of SOTA baselines under different environment initialization random seeds. We can observe that success rate drops while internal signals like KL divergence and gradient norms explode (shown in orange background).
  • Figure 2: Overview of the proposed Uncertainty-Guided Exploration Control at both token and turn levels.
  • Figure 3: Contour of $H_t$ fails to discriminate highly uncertain distributions near uniformity, while $C_t$ ignores variations in tail probabilities. The proposed signal $M_t$ integrates both measures, producing non-degenerate contour geometry that distinguishes distributions sharing identical top-$k$ probability but differing residual mass.
  • Figure 4: (a) Uncertainty dynamics of self-calibrated signal $M_t$ on response length. (b) Word cloud of tokens with the highest uncertainty. (c) Colormap of the uncertainty signal aggregated by the sliding window. When the signal falls below $\epsilon$ (corresponding to the brightest token 'Then'), thinking cutoff is triggered.
  • Figure 5: We evaluate both task performance and exploration efficiency. (a) shows that T$^2$PO enables performance to steadily improve without collapse on three different env seeds. In (b), the bar chart shows that the distribution of token consumption for successful trajectories generated by T$^2$PO is substantially lower than that of SOTA baseline. Meanwhile, the line plot indicates that the exploration efficiency of T$^2$PO for successful trajectories is consistently higher. (c) further demonstrates at the turn level that T$^2$PO achieves task completion with also $\sim25\%$ reduced interaction turns during training.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 4.1: TTI Rule
  • Definition 4.2: TDS Rule